[LU-4570] Metadata slowdowns on production filesystem at ORNL - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.6.0
Affects Version/s: Lustre 2.4.1
Labels:
None
Environment:
RHEL 6.4, 2.6.32-358.18.1.el6.x86_64 kernel

Severity:
3
Rank (Obsolete):
12490

Description

Clients:
Some 1.8.6, 1.8.9 clients
Majority 2.4.X clients

Server (noted above)

Users are seeing several minute metadata operation times (interactive performance is usually the canary in the coal mine). We are unable to enable DNE due to 1.8 client interactions and the need to have the data available.

We run periodic lctl dk with +rpctrace and I can provide information.

To date we've found one application that was issuing ~30k LDLM_ENQUEUE RPC's in a 30-45s range (lctl dk overflowed, so only got 45/60s of trace data). We've put that application on hold, but are still seeing some slowdowns.

We have some "ls timers" scattered around compute clusters and have been seeing some responses in the 60-80s ranges for an ls of a pre-determined directory.

Questions:

1. My current guess is that we are not getting any of the metadata improvements in 2.4. Can you confirm?

2. We are running with options mdt mds_num_threads=1024. We are thinking of changing that to 2048. Recommendations?

3. What further data would be of interest in tracking this down? I have the raw rpctrace data. Would it be useful to add +nettrace or +dlmtrace and send that along? (I think +dlmtrace, but others may be good too)

4. If we were to get rid of all the 1.8 clients would the 2.4 metadata improvements be realized without a reboot of the servers? We are waiting on Cray to provide a software stack for some of our login servers that supports 2.4 clients, and are highly blocked on this.

5. If we cannot get rid of the 1.8 clients, are there any other suggestions for increasing the metadata throughput? We are not blocking on host IO to the MDT device (less than 6MB/s as recorded by iostat); we have had some past memory pressure on the MDS (have recently reduced ldlm.namespaces.*.lru_size on cluster (non-Cray) clients from default of unlimited to 128 on compute 2000 on login nodes), but in general the mds host seems responsive and "normal".

Happy to provide more information, just ask and we'll get it to you.

Thanks!

–
-Jason

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

atlas-mds1-sysrq-t.tgz
1.24 MB
28/Feb/14 1:55 PM
ldlm_lockd.c.lustrekernel.txt
69 kB
19/Feb/14 6:12 PM
rhea-lustre-logs-LU4570.tgz
102 kB
19/Feb/14 6:14 PM
rpctrace-mds1.bz2
0.2 kB
01/Feb/14 5:26 AM
titan-lustre-logs-LU4570.tgz
1.63 MB
19/Feb/14 6:14 PM

Issue Links

is related to

LU-4579 Timeout system horribly broken

Resolved

Activity

[LU-4570] Metadata slowdowns on production filesystem at ORNL

James Nunez (Inactive) added a comment - 02/Jul/14 10:15 PM

Jason,

The two patches referenced in this ticket have landed; http://review.whamcloud.com/#/c/10601/ and http://review.whamcloud.com/#/c/6339 .

Are you still experiencing this problem? If not, should be close this ticket?

James Nunez (Inactive) added a comment - 02/Jul/14 10:15 PM Jason, The two patches referenced in this ticket have landed; http://review.whamcloud.com/#/c/10601/ and http://review.whamcloud.com/#/c/6339 . Are you still experiencing this problem? If not, should be close this ticket?

Oleg Drokin added a comment - 20/Mar/14 12:54 PM

I think you are right and this instance is different from before due to network issues.

Oleg Drokin added a comment - 20/Mar/14 12:54 PM I think you are right and this instance is different from before due to network issues.

Matt Ezell added a comment - 12/Mar/14 1:14 AM

We had two "still busy" messages since we last patched, but there were special circumstances:

Mar  8 09:47:54 atlas-mds1 kernel: [159420.993577] Lustre: atlas1-MDT0000: Client b4ea3026-13aa-b280-3f2e-94a4bab53a72 (at 95@gni100) refused reconnection, still busy with 1 active RPCs
Mar  8 09:47:55 atlas-mds1 kernel: [159421.712317] Lustre: atlas1-MDT0000: Client 9fc952d9-a299-68de-a80e-21913bc6e90d (at 9311@gni100) refused reconnection, still busy with 1 active RPCs

Around the same time, a Gemini network error on our Cray caused the network to quiesce and re-route.

2014-03-08T09:47:27.602931-05:00 c24-0c0s0n3 Lustre: 20310:0:(client.c:1533:ptlrpc_expire_one_request()) @@@ Request x1459682730665633 sent from MGC10.36.226.77@o2ib200 to NID 10.36.226.77@o2ib200 17s ago has timed out (17s prior to deadline).
2014-03-08T09:47:27.602945-05:00 c24-0c0s0n3  req@ffff8803ff9e3000 x1459682730665633/t0 o400->MGS@MGC10.36.226.77@o2ib200_0:26/25 lens 192/384 e 0 to 1 dl 1394290046 ref 2 fl Rpc:N/0/0 rc 0/0
2014-03-08T09:47:27.602952-05:00 c24-0c0s0n3 Lustre: 20310:0:(client.c:1533:ptlrpc_expire_one_request()) Skipped 19 previous similar messages
2014-03-08T09:47:27.602961-05:00 c24-0c0s0n3 LustreError: 166-1: MGC10.36.226.77@o2ib200: Connection to service MGS via nid 10.36.226.77@o2ib200 was lost; in progress operations using this service will fail.
2014-03-08T09:47:27.602970-05:00 c24-0c0s0n3 Lustre: linkfarm-MDT0000-mdc-ffff880407978c00: Connection to service linkfarm-MDT0000 via nid 10.36.226.83@o2ib200 was lost; in progress operations using this service will wait for recovery to complete.
2014-03-08T09:47:33.651947-05:00 c24-0c0s0n3 LNet: Waiting for hardware quiesce flag to clear
2014-03-08T09:47:57.909338-05:00 c24-0c0s0n3 LNet: 17880:0:(lib-move.c:1828:lnet_parse_put()) Dropping PUT from 12345-10.36.225.169@o2ib218 portal 4 match 1459682730665920 offset 192 length 192: 2
2014-03-08T09:47:57.909351-05:00 c24-0c0s0n3 LNet: 17880:0:(lib-move.c:1828:lnet_parse_put()) Skipped 1 previous similar message
2014-03-08T09:47:57.909358-05:00 c24-0c0s0n3 Lustre: atlas2-MDT0000-mdc-ffff8802ad97c400: Connection restored to service atlas2-MDT0000 using nid 10.36.226.74@o2ib200.
2014-03-08T09:47:57.909366-05:00 c24-0c0s0n3 LNet: 17880:0:(lib-move.c:1828:lnet_parse_put()) Dropping PUT from 12345-10.36.225.245@o2ib219 portal 4 match 1459682730666861 offset 192 length 192: 2
2014-03-08T09:47:57.909425-05:00 c24-0c0s0n3 LNet: 17884:0:(lib-move.c:1828:lnet_parse_put()) Dropping PUT from 12345-10.36.225.42@o2ib203 portal 4 match 1459682730665937 offset 192 length 192: 2
2014-03-08T09:47:57.909432-05:00 c24-0c0s0n3 LNet: 17884:0:(lib-move.c:1828:lnet_parse_put()) Skipped 1516 previous similar messages

I think the clients sent requests, but the response didn't get back to them before the network timed them out. I'm not sure why it timed out, I thought during a quiesce situation gnilnd would extend timeouts. When the client timed out the request, it unlinked the MD, so the responses from the MDS were dropped. The MDS didn't get a NACK, so it thinks it's still waiting on that request to be completed.

This is probably a special situation, and likely cannot explain all the occurrences we've seen of this previously.

Matt Ezell added a comment - 12/Mar/14 1:14 AM We had two "still busy" messages since we last patched, but there were special circumstances: Mar 8 09:47:54 atlas-mds1 kernel: [159420.993577] Lustre: atlas1-MDT0000: Client b4ea3026-13aa-b280-3f2e-94a4bab53a72 (at 95@gni100) refused reconnection, still busy with 1 active RPCs Mar 8 09:47:55 atlas-mds1 kernel: [159421.712317] Lustre: atlas1-MDT0000: Client 9fc952d9-a299-68de-a80e-21913bc6e90d (at 9311@gni100) refused reconnection, still busy with 1 active RPCs Around the same time, a Gemini network error on our Cray caused the network to quiesce and re-route. 2014-03-08T09:47:27.602931-05:00 c24-0c0s0n3 Lustre: 20310:0:(client.c:1533:ptlrpc_expire_one_request()) @@@ Request x1459682730665633 sent from MGC10.36.226.77@o2ib200 to NID 10.36.226.77@o2ib200 17s ago has timed out (17s prior to deadline). 2014-03-08T09:47:27.602945-05:00 c24-0c0s0n3 req@ffff8803ff9e3000 x1459682730665633/t0 o400->MGS@MGC10.36.226.77@o2ib200_0:26/25 lens 192/384 e 0 to 1 dl 1394290046 ref 2 fl Rpc:N/0/0 rc 0/0 2014-03-08T09:47:27.602952-05:00 c24-0c0s0n3 Lustre: 20310:0:(client.c:1533:ptlrpc_expire_one_request()) Skipped 19 previous similar messages 2014-03-08T09:47:27.602961-05:00 c24-0c0s0n3 LustreError: 166-1: MGC10.36.226.77@o2ib200: Connection to service MGS via nid 10.36.226.77@o2ib200 was lost; in progress operations using this service will fail. 2014-03-08T09:47:27.602970-05:00 c24-0c0s0n3 Lustre: linkfarm-MDT0000-mdc-ffff880407978c00: Connection to service linkfarm-MDT0000 via nid 10.36.226.83@o2ib200 was lost; in progress operations using this service will wait for recovery to complete. 2014-03-08T09:47:33.651947-05:00 c24-0c0s0n3 LNet: Waiting for hardware quiesce flag to clear 2014-03-08T09:47:57.909338-05:00 c24-0c0s0n3 LNet: 17880:0:(lib-move.c:1828:lnet_parse_put()) Dropping PUT from 12345-10.36.225.169@o2ib218 portal 4 match 1459682730665920 offset 192 length 192: 2 2014-03-08T09:47:57.909351-05:00 c24-0c0s0n3 LNet: 17880:0:(lib-move.c:1828:lnet_parse_put()) Skipped 1 previous similar message 2014-03-08T09:47:57.909358-05:00 c24-0c0s0n3 Lustre: atlas2-MDT0000-mdc-ffff8802ad97c400: Connection restored to service atlas2-MDT0000 using nid 10.36.226.74@o2ib200. 2014-03-08T09:47:57.909366-05:00 c24-0c0s0n3 LNet: 17880:0:(lib-move.c:1828:lnet_parse_put()) Dropping PUT from 12345-10.36.225.245@o2ib219 portal 4 match 1459682730666861 offset 192 length 192: 2 2014-03-08T09:47:57.909425-05:00 c24-0c0s0n3 LNet: 17884:0:(lib-move.c:1828:lnet_parse_put()) Dropping PUT from 12345-10.36.225.42@o2ib203 portal 4 match 1459682730665937 offset 192 length 192: 2 2014-03-08T09:47:57.909432-05:00 c24-0c0s0n3 LNet: 17884:0:(lib-move.c:1828:lnet_parse_put()) Skipped 1516 previous similar messages I think the clients sent requests, but the response didn't get back to them before the network timed them out. I'm not sure why it timed out, I thought during a quiesce situation gnilnd would extend timeouts. When the client timed out the request, it unlinked the MD, so the responses from the MDS were dropped. The MDS didn't get a NACK, so it thinks it's still waiting on that request to be completed. This is probably a special situation, and likely cannot explain all the occurrences we've seen of this previously.

Oleg Drokin added a comment - 03/Mar/14 7:53 PM

I guess another bit of useful info would be if you can cross reference jobs run on the nodes that enter this stuck state and could not reconnect due to BUSY on MDT.
Is it by any chance the same sort of job?

Oleg Drokin added a comment - 03/Mar/14 7:53 PM I guess another bit of useful info would be if you can cross reference jobs run on the nodes that enter this stuck state and could not reconnect due to BUSY on MDT. Is it by any chance the same sort of job?

Oleg Drokin added a comment - 03/Mar/14 5:46 PM

Hm, looking at the traces I see 134 threads waiting for object allocations from OSTs.
355 threads are waiting for locks to be cancelled.

This is nothing too critical I imagine.
There's nothing waiting in the readdir, so that angle did not realize, but all evicted locks are still PR locks pointing at this readdir lock somehow.

If you can manage to dump sysrq-t trace from a node that's trying to reconnect and get's refused with 'refused reconnection, still busy with X active RPCs' while this is ongoing, I am interested in getting that.

Oleg Drokin added a comment - 03/Mar/14 5:46 PM Hm, looking at the traces I see 134 threads waiting for object allocations from OSTs. 355 threads are waiting for locks to be cancelled. This is nothing too critical I imagine. There's nothing waiting in the readdir, so that angle did not realize, but all evicted locks are still PR locks pointing at this readdir lock somehow. If you can manage to dump sysrq-t trace from a node that's trying to reconnect and get's refused with 'refused reconnection, still busy with X active RPCs' while this is ongoing, I am interested in getting that.

James A Simmons added a comment - 28/Feb/14 2:15 PM

I uploaded the MDS strack traces to ftp.whamcloud.com/uploads/~~LU-4570~~/mds1.20140227.console.log.bz2

James A Simmons added a comment - 28/Feb/14 2:15 PM I uploaded the MDS strack traces to ftp.whamcloud.com/uploads/ LU-4570 /mds1.20140227.console.log.bz2

Jason Hill (Inactive) added a comment - 28/Feb/14 1:55 PM

console output of echo t > /proc/sysrq-trigger under load.

Jason Hill (Inactive) added a comment - 28/Feb/14 1:55 PM console output of echo t > /proc/sysrq-trigger under load.

Oleg Drokin added a comment - 22/Feb/14 5:35 AM

So, I've been digging around the code and it looks like we send close RPCs to readdir portal now, so if you have heavy close traffic, and close is affected by ~~LU-4008~~ making it to use tons of CPU time, it is probably quite feasible that all of those are consumed?
This would probably also mean that you have quite a ton of them present once the problem hit (and it only exacerbates the lock contention on vmalloc).
Hopefully with ~~LU-4008~~ we are going to hit two birds with one stone as general close processing time should improve, at the same time improving readdir processing time that should free readdir locks much faster.

Oleg Drokin added a comment - 22/Feb/14 5:35 AM So, I've been digging around the code and it looks like we send close RPCs to readdir portal now, so if you have heavy close traffic, and close is affected by LU-4008 making it to use tons of CPU time, it is probably quite feasible that all of those are consumed? This would probably also mean that you have quite a ton of them present once the problem hit (and it only exacerbates the lock contention on vmalloc). Hopefully with LU-4008 we are going to hit two birds with one stone as general close processing time should improve, at the same time improving readdir processing time that should free readdir locks much faster.

Oleg Drokin added a comment - 20/Feb/14 2:59 AM

Hm, so all the trips of my at_max message seem to be at max 755 seconds.
This seem to be coming from at_est2timeout() that potentially can increase the maximum timeout possible by 125% + 5 seconds. I guess clients won't really appreciate that certain lock operations would take longer than at_max when such a conflicting lock is encountered.
But this only happens on OSTs, so it would not explain any timeouts on MDS.

On the client logs we see timeouts both to OSTs and MDTs, but sadly the reduced printing only shows xid, but not the opc so there's no way to know what requests are failing. Interesting that there's quite a bunch of timeouts that are in low range - 17s, 19s, 31s could not be explained by too long timeouts or anything like that. I wonder why those were not replied to by the server in time, if only to send an early reply?
Timeouts go as high as 829 seconds and more too, which is way higher than at_max interestingly (high network latency setting at play?)

Oleg Drokin added a comment - 20/Feb/14 2:59 AM Hm, so all the trips of my at_max message seem to be at max 755 seconds. This seem to be coming from at_est2timeout() that potentially can increase the maximum timeout possible by 125% + 5 seconds. I guess clients won't really appreciate that certain lock operations would take longer than at_max when such a conflicting lock is encountered. But this only happens on OSTs, so it would not explain any timeouts on MDS. On the client logs we see timeouts both to OSTs and MDTs, but sadly the reduced printing only shows xid, but not the opc so there's no way to know what requests are failing. Interesting that there's quite a bunch of timeouts that are in low range - 17s, 19s, 31s could not be explained by too long timeouts or anything like that. I wonder why those were not replied to by the server in time, if only to send an early reply? Timeouts go as high as 829 seconds and more too, which is way higher than at_max interestingly (high network latency setting at play?)

Jason Hill (Inactive) added a comment - 19/Feb/14 6:12 PM

Oleg, I'm attaching the Lustre logs from atlas and Titan during the timeranges from above (we hit some more this morning around 9:40 also). Curiously there are no titan entries for most of that time. I will to look over at our smaller Cray XC30 (no Lustre messages during any of these time frames), and also our 200 node linux cluster – those are our three biggest consumers of Lustre – to get their log data from these time frames.

hilljj@titan-smw0:/tmp$ for i in $(ls titan-lustre-2014-02-1*); do wc -l $i; done
114989 titan-lustre-2014-02-18-hr10.txt
0 titan-lustre-2014-02-18-hr15.txt
8 titan-lustre-2014-02-18-hr16.txt
0 titan-lustre-2014-02-18-hr17.txt
45 titan-lustre-2014-02-18-hr18.txt
0 titan-lustre-2014-02-18-hr21.txt
0 titan-lustre-2014-02-18-hr22.txt
0 titan-lustre-2014-02-18-hr23.txt
0 titan-lustre-2014-02-19-hr00.txt
0 titan-lustre-2014-02-19-hr01.txt
4 titan-lustre-2014-02-19-hr09.txt
hilljj@titan-smw0:/tmp$ md5sum titan-lustre-logs-LU4570.tgz
79571cb30d28a42e969461eeed947279 titan-lustre-logs-LU4570.tgz

[root@rhea-mgmt2 tmp]# wc -l rhea-lustre-2014-02-1*
0 rhea-lustre-2014-02-18-hr10.txt
714 rhea-lustre-2014-02-18-hr15.txt
1347 rhea-lustre-2014-02-18-hr16.txt
1179 rhea-lustre-2014-02-18-hr17.txt
1605 rhea-lustre-2014-02-18-hr18.txt
318 rhea-lustre-2014-02-18-hr21.txt
337 rhea-lustre-2014-02-18-hr22.txt
364 rhea-lustre-2014-02-18-hr23.txt
293 rhea-lustre-2014-02-19-hr00.txt
200 rhea-lustre-2014-02-19-hr01.txt
787 rhea-lustre-2014-02-19-hr09.txt
7144 total

[root@rhea-mgmt2 tmp]# md5sum rhea-lustre-logs-LU4570.tgz
cfdf017d0d5b4b7ef813b74f21f687cb rhea-lustre-logs-LU4570.tgz

hilljj@hill-desktop:~$ wc -l ccs_homedir/ldlm_lockd.c.lustrekernel.txt
397 ccs_homedir/ldlm_lockd.c.lustrekernel.txt

Jason Hill (Inactive) added a comment - 19/Feb/14 6:12 PM Oleg, I'm attaching the Lustre logs from atlas and Titan during the timeranges from above (we hit some more this morning around 9:40 also). Curiously there are no titan entries for most of that time. I will to look over at our smaller Cray XC30 (no Lustre messages during any of these time frames), and also our 200 node linux cluster – those are our three biggest consumers of Lustre – to get their log data from these time frames. hilljj@titan-smw0:/tmp$ for i in $(ls titan-lustre-2014-02-1*); do wc -l $i; done 114989 titan-lustre-2014-02-18-hr10.txt 0 titan-lustre-2014-02-18-hr15.txt 8 titan-lustre-2014-02-18-hr16.txt 0 titan-lustre-2014-02-18-hr17.txt 45 titan-lustre-2014-02-18-hr18.txt 0 titan-lustre-2014-02-18-hr21.txt 0 titan-lustre-2014-02-18-hr22.txt 0 titan-lustre-2014-02-18-hr23.txt 0 titan-lustre-2014-02-19-hr00.txt 0 titan-lustre-2014-02-19-hr01.txt 4 titan-lustre-2014-02-19-hr09.txt hilljj@titan-smw0:/tmp$ md5sum titan-lustre-logs-LU4570.tgz 79571cb30d28a42e969461eeed947279 titan-lustre-logs-LU4570.tgz [root@rhea-mgmt2 tmp] # wc -l rhea-lustre-2014-02-1* 0 rhea-lustre-2014-02-18-hr10.txt 714 rhea-lustre-2014-02-18-hr15.txt 1347 rhea-lustre-2014-02-18-hr16.txt 1179 rhea-lustre-2014-02-18-hr17.txt 1605 rhea-lustre-2014-02-18-hr18.txt 318 rhea-lustre-2014-02-18-hr21.txt 337 rhea-lustre-2014-02-18-hr22.txt 364 rhea-lustre-2014-02-18-hr23.txt 293 rhea-lustre-2014-02-19-hr00.txt 200 rhea-lustre-2014-02-19-hr01.txt 787 rhea-lustre-2014-02-19-hr09.txt 7144 total [root@rhea-mgmt2 tmp] # md5sum rhea-lustre-logs-LU4570.tgz cfdf017d0d5b4b7ef813b74f21f687cb rhea-lustre-logs-LU4570.tgz hilljj@hill-desktop:~$ wc -l ccs_homedir/ldlm_lockd.c.lustrekernel.txt 397 ccs_homedir/ldlm_lockd.c.lustrekernel.txt

James A Simmons added a comment - 19/Feb/14 5:44 PM - edited

I was just discussing that on ~~LU-4008~~. I have some data to show what exactly is causing the vmalloc lock contention. The only draw back to that approach is memory fragmentation.

James A Simmons added a comment - 19/Feb/14 5:44 PM - edited I was just discussing that on LU-4008 . I have some data to show what exactly is causing the vmalloc lock contention. The only draw back to that approach is memory fragmentation.

People

Assignee:: Oleg Drokin

Reporter:: Jason Hill (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 25 Start watching this issue

Dates

Created:: 01/Feb/14 3:15 AM

Updated:: 09/Jul/14 10:25 PM

Resolved:: 09/Jul/14 10:25 PM