Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4570

Metadata slowdowns on production filesystem at ORNL

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.6.0
    • Lustre 2.4.1
    • None
    • RHEL 6.4, 2.6.32-358.18.1.el6.x86_64 kernel
    • 3
    • 12490

    Description

      Clients:
      Some 1.8.6, 1.8.9 clients
      Majority 2.4.X clients

      Server (noted above)

      Users are seeing several minute metadata operation times (interactive performance is usually the canary in the coal mine). We are unable to enable DNE due to 1.8 client interactions and the need to have the data available.

      We run periodic lctl dk with +rpctrace and I can provide information.

      To date we've found one application that was issuing ~30k LDLM_ENQUEUE RPC's in a 30-45s range (lctl dk overflowed, so only got 45/60s of trace data). We've put that application on hold, but are still seeing some slowdowns.

      We have some "ls timers" scattered around compute clusters and have been seeing some responses in the 60-80s ranges for an ls of a pre-determined directory.

      Questions:

      1. My current guess is that we are not getting any of the metadata improvements in 2.4. Can you confirm?

      2. We are running with options mdt mds_num_threads=1024. We are thinking of changing that to 2048. Recommendations?

      3. What further data would be of interest in tracking this down? I have the raw rpctrace data. Would it be useful to add +nettrace or +dlmtrace and send that along? (I think +dlmtrace, but others may be good too)

      4. If we were to get rid of all the 1.8 clients would the 2.4 metadata improvements be realized without a reboot of the servers? We are waiting on Cray to provide a software stack for some of our login servers that supports 2.4 clients, and are highly blocked on this.

      5. If we cannot get rid of the 1.8 clients, are there any other suggestions for increasing the metadata throughput? We are not blocking on host IO to the MDT device (less than 6MB/s as recorded by iostat); we have had some past memory pressure on the MDS (have recently reduced ldlm.namespaces.*.lru_size on cluster (non-Cray) clients from default of unlimited to 128 on compute 2000 on login nodes), but in general the mds host seems responsive and "normal".

      Happy to provide more information, just ask and we'll get it to you.

      Thanks!


      -Jason

      Attachments

        1. atlas-mds1-sysrq-t.tgz
          1.24 MB
        2. ldlm_lockd.c.lustrekernel.txt
          69 kB
        3. rhea-lustre-logs-LU4570.tgz
          102 kB
        4. rpctrace-mds1.bz2
          0.2 kB
        5. titan-lustre-logs-LU4570.tgz
          1.63 MB

        Issue Links

          Activity

            [LU-4570] Metadata slowdowns on production filesystem at ORNL

            Jason,

            The two patches referenced in this ticket have landed; http://review.whamcloud.com/#/c/10601/ and http://review.whamcloud.com/#/c/6339 .

            Are you still experiencing this problem? If not, should be close this ticket?

            jamesanunez James Nunez (Inactive) added a comment - Jason, The two patches referenced in this ticket have landed; http://review.whamcloud.com/#/c/10601/ and http://review.whamcloud.com/#/c/6339 . Are you still experiencing this problem? If not, should be close this ticket?
            green Oleg Drokin added a comment -

            I think you are right and this instance is different from before due to network issues.

            green Oleg Drokin added a comment - I think you are right and this instance is different from before due to network issues.
            ezell Matt Ezell added a comment -

            We had two "still busy" messages since we last patched, but there were special circumstances:

            Mar  8 09:47:54 atlas-mds1 kernel: [159420.993577] Lustre: atlas1-MDT0000: Client b4ea3026-13aa-b280-3f2e-94a4bab53a72 (at 95@gni100) refused reconnection, still busy with 1 active RPCs
            Mar  8 09:47:55 atlas-mds1 kernel: [159421.712317] Lustre: atlas1-MDT0000: Client 9fc952d9-a299-68de-a80e-21913bc6e90d (at 9311@gni100) refused reconnection, still busy with 1 active RPCs
            

            Around the same time, a Gemini network error on our Cray caused the network to quiesce and re-route.

            2014-03-08T09:47:27.602931-05:00 c24-0c0s0n3 Lustre: 20310:0:(client.c:1533:ptlrpc_expire_one_request()) @@@ Request x1459682730665633 sent from MGC10.36.226.77@o2ib200 to NID 10.36.226.77@o2ib200 17s ago has timed out (17s prior to deadline).
            2014-03-08T09:47:27.602945-05:00 c24-0c0s0n3  req@ffff8803ff9e3000 x1459682730665633/t0 o400->MGS@MGC10.36.226.77@o2ib200_0:26/25 lens 192/384 e 0 to 1 dl 1394290046 ref 2 fl Rpc:N/0/0 rc 0/0
            2014-03-08T09:47:27.602952-05:00 c24-0c0s0n3 Lustre: 20310:0:(client.c:1533:ptlrpc_expire_one_request()) Skipped 19 previous similar messages
            2014-03-08T09:47:27.602961-05:00 c24-0c0s0n3 LustreError: 166-1: MGC10.36.226.77@o2ib200: Connection to service MGS via nid 10.36.226.77@o2ib200 was lost; in progress operations using this service will fail.
            2014-03-08T09:47:27.602970-05:00 c24-0c0s0n3 Lustre: linkfarm-MDT0000-mdc-ffff880407978c00: Connection to service linkfarm-MDT0000 via nid 10.36.226.83@o2ib200 was lost; in progress operations using this service will wait for recovery to complete.
            2014-03-08T09:47:33.651947-05:00 c24-0c0s0n3 LNet: Waiting for hardware quiesce flag to clear
            2014-03-08T09:47:57.909338-05:00 c24-0c0s0n3 LNet: 17880:0:(lib-move.c:1828:lnet_parse_put()) Dropping PUT from 12345-10.36.225.169@o2ib218 portal 4 match 1459682730665920 offset 192 length 192: 2
            2014-03-08T09:47:57.909351-05:00 c24-0c0s0n3 LNet: 17880:0:(lib-move.c:1828:lnet_parse_put()) Skipped 1 previous similar message
            2014-03-08T09:47:57.909358-05:00 c24-0c0s0n3 Lustre: atlas2-MDT0000-mdc-ffff8802ad97c400: Connection restored to service atlas2-MDT0000 using nid 10.36.226.74@o2ib200.
            2014-03-08T09:47:57.909366-05:00 c24-0c0s0n3 LNet: 17880:0:(lib-move.c:1828:lnet_parse_put()) Dropping PUT from 12345-10.36.225.245@o2ib219 portal 4 match 1459682730666861 offset 192 length 192: 2
            2014-03-08T09:47:57.909425-05:00 c24-0c0s0n3 LNet: 17884:0:(lib-move.c:1828:lnet_parse_put()) Dropping PUT from 12345-10.36.225.42@o2ib203 portal 4 match 1459682730665937 offset 192 length 192: 2
            2014-03-08T09:47:57.909432-05:00 c24-0c0s0n3 LNet: 17884:0:(lib-move.c:1828:lnet_parse_put()) Skipped 1516 previous similar messages
            

            I think the clients sent requests, but the response didn't get back to them before the network timed them out. I'm not sure why it timed out, I thought during a quiesce situation gnilnd would extend timeouts. When the client timed out the request, it unlinked the MD, so the responses from the MDS were dropped. The MDS didn't get a NACK, so it thinks it's still waiting on that request to be completed.

            This is probably a special situation, and likely cannot explain all the occurrences we've seen of this previously.

            ezell Matt Ezell added a comment - We had two "still busy" messages since we last patched, but there were special circumstances: Mar 8 09:47:54 atlas-mds1 kernel: [159420.993577] Lustre: atlas1-MDT0000: Client b4ea3026-13aa-b280-3f2e-94a4bab53a72 (at 95@gni100) refused reconnection, still busy with 1 active RPCs Mar 8 09:47:55 atlas-mds1 kernel: [159421.712317] Lustre: atlas1-MDT0000: Client 9fc952d9-a299-68de-a80e-21913bc6e90d (at 9311@gni100) refused reconnection, still busy with 1 active RPCs Around the same time, a Gemini network error on our Cray caused the network to quiesce and re-route. 2014-03-08T09:47:27.602931-05:00 c24-0c0s0n3 Lustre: 20310:0:(client.c:1533:ptlrpc_expire_one_request()) @@@ Request x1459682730665633 sent from MGC10.36.226.77@o2ib200 to NID 10.36.226.77@o2ib200 17s ago has timed out (17s prior to deadline). 2014-03-08T09:47:27.602945-05:00 c24-0c0s0n3 req@ffff8803ff9e3000 x1459682730665633/t0 o400->MGS@MGC10.36.226.77@o2ib200_0:26/25 lens 192/384 e 0 to 1 dl 1394290046 ref 2 fl Rpc:N/0/0 rc 0/0 2014-03-08T09:47:27.602952-05:00 c24-0c0s0n3 Lustre: 20310:0:(client.c:1533:ptlrpc_expire_one_request()) Skipped 19 previous similar messages 2014-03-08T09:47:27.602961-05:00 c24-0c0s0n3 LustreError: 166-1: MGC10.36.226.77@o2ib200: Connection to service MGS via nid 10.36.226.77@o2ib200 was lost; in progress operations using this service will fail. 2014-03-08T09:47:27.602970-05:00 c24-0c0s0n3 Lustre: linkfarm-MDT0000-mdc-ffff880407978c00: Connection to service linkfarm-MDT0000 via nid 10.36.226.83@o2ib200 was lost; in progress operations using this service will wait for recovery to complete. 2014-03-08T09:47:33.651947-05:00 c24-0c0s0n3 LNet: Waiting for hardware quiesce flag to clear 2014-03-08T09:47:57.909338-05:00 c24-0c0s0n3 LNet: 17880:0:(lib-move.c:1828:lnet_parse_put()) Dropping PUT from 12345-10.36.225.169@o2ib218 portal 4 match 1459682730665920 offset 192 length 192: 2 2014-03-08T09:47:57.909351-05:00 c24-0c0s0n3 LNet: 17880:0:(lib-move.c:1828:lnet_parse_put()) Skipped 1 previous similar message 2014-03-08T09:47:57.909358-05:00 c24-0c0s0n3 Lustre: atlas2-MDT0000-mdc-ffff8802ad97c400: Connection restored to service atlas2-MDT0000 using nid 10.36.226.74@o2ib200. 2014-03-08T09:47:57.909366-05:00 c24-0c0s0n3 LNet: 17880:0:(lib-move.c:1828:lnet_parse_put()) Dropping PUT from 12345-10.36.225.245@o2ib219 portal 4 match 1459682730666861 offset 192 length 192: 2 2014-03-08T09:47:57.909425-05:00 c24-0c0s0n3 LNet: 17884:0:(lib-move.c:1828:lnet_parse_put()) Dropping PUT from 12345-10.36.225.42@o2ib203 portal 4 match 1459682730665937 offset 192 length 192: 2 2014-03-08T09:47:57.909432-05:00 c24-0c0s0n3 LNet: 17884:0:(lib-move.c:1828:lnet_parse_put()) Skipped 1516 previous similar messages I think the clients sent requests, but the response didn't get back to them before the network timed them out. I'm not sure why it timed out, I thought during a quiesce situation gnilnd would extend timeouts. When the client timed out the request, it unlinked the MD, so the responses from the MDS were dropped. The MDS didn't get a NACK, so it thinks it's still waiting on that request to be completed. This is probably a special situation, and likely cannot explain all the occurrences we've seen of this previously.
            green Oleg Drokin added a comment -

            I guess another bit of useful info would be if you can cross reference jobs run on the nodes that enter this stuck state and could not reconnect due to BUSY on MDT.
            Is it by any chance the same sort of job?

            green Oleg Drokin added a comment - I guess another bit of useful info would be if you can cross reference jobs run on the nodes that enter this stuck state and could not reconnect due to BUSY on MDT. Is it by any chance the same sort of job?
            green Oleg Drokin added a comment -

            Hm, looking at the traces I see 134 threads waiting for object allocations from OSTs.
            355 threads are waiting for locks to be cancelled.

            This is nothing too critical I imagine.
            There's nothing waiting in the readdir, so that angle did not realize, but all evicted locks are still PR locks pointing at this readdir lock somehow.

            If you can manage to dump sysrq-t trace from a node that's trying to reconnect and get's refused with 'refused reconnection, still busy with X active RPCs' while this is ongoing, I am interested in getting that.

            green Oleg Drokin added a comment - Hm, looking at the traces I see 134 threads waiting for object allocations from OSTs. 355 threads are waiting for locks to be cancelled. This is nothing too critical I imagine. There's nothing waiting in the readdir, so that angle did not realize, but all evicted locks are still PR locks pointing at this readdir lock somehow. If you can manage to dump sysrq-t trace from a node that's trying to reconnect and get's refused with 'refused reconnection, still busy with X active RPCs' while this is ongoing, I am interested in getting that.

            I uploaded the MDS strack traces to ftp.whamcloud.com/uploads/LU-4570/mds1.20140227.console.log.bz2

            simmonsja James A Simmons added a comment - I uploaded the MDS strack traces to ftp.whamcloud.com/uploads/ LU-4570 /mds1.20140227.console.log.bz2

            console output of echo t > /proc/sysrq-trigger under load.

            hilljjornl Jason Hill (Inactive) added a comment - console output of echo t > /proc/sysrq-trigger under load.
            green Oleg Drokin added a comment -

            So, I've been digging around the code and it looks like we send close RPCs to readdir portal now, so if you have heavy close traffic, and close is affected by LU-4008 making it to use tons of CPU time, it is probably quite feasible that all of those are consumed?
            This would probably also mean that you have quite a ton of them present once the problem hit (and it only exacerbates the lock contention on vmalloc).
            Hopefully with LU-4008 we are going to hit two birds with one stone as general close processing time should improve, at the same time improving readdir processing time that should free readdir locks much faster.

            green Oleg Drokin added a comment - So, I've been digging around the code and it looks like we send close RPCs to readdir portal now, so if you have heavy close traffic, and close is affected by LU-4008 making it to use tons of CPU time, it is probably quite feasible that all of those are consumed? This would probably also mean that you have quite a ton of them present once the problem hit (and it only exacerbates the lock contention on vmalloc). Hopefully with LU-4008 we are going to hit two birds with one stone as general close processing time should improve, at the same time improving readdir processing time that should free readdir locks much faster.
            green Oleg Drokin added a comment -

            Hm, so all the trips of my at_max message seem to be at max 755 seconds.
            This seem to be coming from at_est2timeout() that potentially can increase the maximum timeout possible by 125% + 5 seconds. I guess clients won't really appreciate that certain lock operations would take longer than at_max when such a conflicting lock is encountered.
            But this only happens on OSTs, so it would not explain any timeouts on MDS.

            On the client logs we see timeouts both to OSTs and MDTs, but sadly the reduced printing only shows xid, but not the opc so there's no way to know what requests are failing. Interesting that there's quite a bunch of timeouts that are in low range - 17s, 19s, 31s could not be explained by too long timeouts or anything like that. I wonder why those were not replied to by the server in time, if only to send an early reply?
            Timeouts go as high as 829 seconds and more too, which is way higher than at_max interestingly (high network latency setting at play?)

            green Oleg Drokin added a comment - Hm, so all the trips of my at_max message seem to be at max 755 seconds. This seem to be coming from at_est2timeout() that potentially can increase the maximum timeout possible by 125% + 5 seconds. I guess clients won't really appreciate that certain lock operations would take longer than at_max when such a conflicting lock is encountered. But this only happens on OSTs, so it would not explain any timeouts on MDS. On the client logs we see timeouts both to OSTs and MDTs, but sadly the reduced printing only shows xid, but not the opc so there's no way to know what requests are failing. Interesting that there's quite a bunch of timeouts that are in low range - 17s, 19s, 31s could not be explained by too long timeouts or anything like that. I wonder why those were not replied to by the server in time, if only to send an early reply? Timeouts go as high as 829 seconds and more too, which is way higher than at_max interestingly (high network latency setting at play?)

            Oleg, I'm attaching the Lustre logs from atlas and Titan during the timeranges from above (we hit some more this morning around 9:40 also). Curiously there are no titan entries for most of that time. I will to look over at our smaller Cray XC30 (no Lustre messages during any of these time frames), and also our 200 node linux cluster – those are our three biggest consumers of Lustre – to get their log data from these time frames.

            hilljj@titan-smw0:/tmp$ for i in $(ls titan-lustre-2014-02-1*); do wc -l $i; done
            114989 titan-lustre-2014-02-18-hr10.txt
            0 titan-lustre-2014-02-18-hr15.txt
            8 titan-lustre-2014-02-18-hr16.txt
            0 titan-lustre-2014-02-18-hr17.txt
            45 titan-lustre-2014-02-18-hr18.txt
            0 titan-lustre-2014-02-18-hr21.txt
            0 titan-lustre-2014-02-18-hr22.txt
            0 titan-lustre-2014-02-18-hr23.txt
            0 titan-lustre-2014-02-19-hr00.txt
            0 titan-lustre-2014-02-19-hr01.txt
            4 titan-lustre-2014-02-19-hr09.txt
            hilljj@titan-smw0:/tmp$ md5sum titan-lustre-logs-LU4570.tgz
            79571cb30d28a42e969461eeed947279 titan-lustre-logs-LU4570.tgz

            [root@rhea-mgmt2 tmp]# wc -l rhea-lustre-2014-02-1*
            0 rhea-lustre-2014-02-18-hr10.txt
            714 rhea-lustre-2014-02-18-hr15.txt
            1347 rhea-lustre-2014-02-18-hr16.txt
            1179 rhea-lustre-2014-02-18-hr17.txt
            1605 rhea-lustre-2014-02-18-hr18.txt
            318 rhea-lustre-2014-02-18-hr21.txt
            337 rhea-lustre-2014-02-18-hr22.txt
            364 rhea-lustre-2014-02-18-hr23.txt
            293 rhea-lustre-2014-02-19-hr00.txt
            200 rhea-lustre-2014-02-19-hr01.txt
            787 rhea-lustre-2014-02-19-hr09.txt
            7144 total

            [root@rhea-mgmt2 tmp]# md5sum rhea-lustre-logs-LU4570.tgz
            cfdf017d0d5b4b7ef813b74f21f687cb rhea-lustre-logs-LU4570.tgz

            hilljj@hill-desktop:~$ wc -l ccs_homedir/ldlm_lockd.c.lustrekernel.txt
            397 ccs_homedir/ldlm_lockd.c.lustrekernel.txt

            hilljjornl Jason Hill (Inactive) added a comment - Oleg, I'm attaching the Lustre logs from atlas and Titan during the timeranges from above (we hit some more this morning around 9:40 also). Curiously there are no titan entries for most of that time. I will to look over at our smaller Cray XC30 (no Lustre messages during any of these time frames), and also our 200 node linux cluster – those are our three biggest consumers of Lustre – to get their log data from these time frames. hilljj@titan-smw0:/tmp$ for i in $(ls titan-lustre-2014-02-1*); do wc -l $i; done 114989 titan-lustre-2014-02-18-hr10.txt 0 titan-lustre-2014-02-18-hr15.txt 8 titan-lustre-2014-02-18-hr16.txt 0 titan-lustre-2014-02-18-hr17.txt 45 titan-lustre-2014-02-18-hr18.txt 0 titan-lustre-2014-02-18-hr21.txt 0 titan-lustre-2014-02-18-hr22.txt 0 titan-lustre-2014-02-18-hr23.txt 0 titan-lustre-2014-02-19-hr00.txt 0 titan-lustre-2014-02-19-hr01.txt 4 titan-lustre-2014-02-19-hr09.txt hilljj@titan-smw0:/tmp$ md5sum titan-lustre-logs-LU4570.tgz 79571cb30d28a42e969461eeed947279 titan-lustre-logs-LU4570.tgz [root@rhea-mgmt2 tmp] # wc -l rhea-lustre-2014-02-1* 0 rhea-lustre-2014-02-18-hr10.txt 714 rhea-lustre-2014-02-18-hr15.txt 1347 rhea-lustre-2014-02-18-hr16.txt 1179 rhea-lustre-2014-02-18-hr17.txt 1605 rhea-lustre-2014-02-18-hr18.txt 318 rhea-lustre-2014-02-18-hr21.txt 337 rhea-lustre-2014-02-18-hr22.txt 364 rhea-lustre-2014-02-18-hr23.txt 293 rhea-lustre-2014-02-19-hr00.txt 200 rhea-lustre-2014-02-19-hr01.txt 787 rhea-lustre-2014-02-19-hr09.txt 7144 total [root@rhea-mgmt2 tmp] # md5sum rhea-lustre-logs-LU4570.tgz cfdf017d0d5b4b7ef813b74f21f687cb rhea-lustre-logs-LU4570.tgz hilljj@hill-desktop:~$ wc -l ccs_homedir/ldlm_lockd.c.lustrekernel.txt 397 ccs_homedir/ldlm_lockd.c.lustrekernel.txt
            simmonsja James A Simmons added a comment - - edited

            I was just discussing that on LU-4008. I have some data to show what exactly is causing the vmalloc lock contention. The only draw back to that approach is memory fragmentation.

            simmonsja James A Simmons added a comment - - edited I was just discussing that on LU-4008 . I have some data to show what exactly is causing the vmalloc lock contention. The only draw back to that approach is memory fragmentation.

            People

              green Oleg Drokin
              hilljjornl Jason Hill (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              25 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: