Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4570

Metadata slowdowns on production filesystem at ORNL

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.6.0
    • Lustre 2.4.1
    • None
    • RHEL 6.4, 2.6.32-358.18.1.el6.x86_64 kernel
    • 3
    • 12490

    Description

      Clients:
      Some 1.8.6, 1.8.9 clients
      Majority 2.4.X clients

      Server (noted above)

      Users are seeing several minute metadata operation times (interactive performance is usually the canary in the coal mine). We are unable to enable DNE due to 1.8 client interactions and the need to have the data available.

      We run periodic lctl dk with +rpctrace and I can provide information.

      To date we've found one application that was issuing ~30k LDLM_ENQUEUE RPC's in a 30-45s range (lctl dk overflowed, so only got 45/60s of trace data). We've put that application on hold, but are still seeing some slowdowns.

      We have some "ls timers" scattered around compute clusters and have been seeing some responses in the 60-80s ranges for an ls of a pre-determined directory.

      Questions:

      1. My current guess is that we are not getting any of the metadata improvements in 2.4. Can you confirm?

      2. We are running with options mdt mds_num_threads=1024. We are thinking of changing that to 2048. Recommendations?

      3. What further data would be of interest in tracking this down? I have the raw rpctrace data. Would it be useful to add +nettrace or +dlmtrace and send that along? (I think +dlmtrace, but others may be good too)

      4. If we were to get rid of all the 1.8 clients would the 2.4 metadata improvements be realized without a reboot of the servers? We are waiting on Cray to provide a software stack for some of our login servers that supports 2.4 clients, and are highly blocked on this.

      5. If we cannot get rid of the 1.8 clients, are there any other suggestions for increasing the metadata throughput? We are not blocking on host IO to the MDT device (less than 6MB/s as recorded by iostat); we have had some past memory pressure on the MDS (have recently reduced ldlm.namespaces.*.lru_size on cluster (non-Cray) clients from default of unlimited to 128 on compute 2000 on login nodes), but in general the mds host seems responsive and "normal".

      Happy to provide more information, just ask and we'll get it to you.

      Thanks!


      -Jason

      Attachments

        1. atlas-mds1-sysrq-t.tgz
          1.24 MB
        2. ldlm_lockd.c.lustrekernel.txt
          69 kB
        3. rhea-lustre-logs-LU4570.tgz
          102 kB
        4. rpctrace-mds1.bz2
          0.2 kB
        5. titan-lustre-logs-LU4570.tgz
          1.63 MB

        Issue Links

          Activity

            [LU-4570] Metadata slowdowns on production filesystem at ORNL

            We will track the b2_5 patches separately.

            jamesanunez James Nunez (Inactive) added a comment - We will track the b2_5 patches separately.

            As a note I back ported both of the above patches to 2.5. They are at :

            http://review.whamcloud.com/#/c/10962
            http://review.whamcloud.com/#/c/10997

            simmonsja James A Simmons added a comment - As a note I back ported both of the above patches to 2.5. They are at : http://review.whamcloud.com/#/c/10962 http://review.whamcloud.com/#/c/10997

            James:

            yes, let's go ahead and resolve this issue.


            -Jason

            hilljjornl Jason Hill (Inactive) added a comment - James: yes, let's go ahead and resolve this issue. – -Jason

            Jason,

            The two patches referenced in this ticket have landed; http://review.whamcloud.com/#/c/10601/ and http://review.whamcloud.com/#/c/6339 .

            Are you still experiencing this problem? If not, should be close this ticket?

            jamesanunez James Nunez (Inactive) added a comment - Jason, The two patches referenced in this ticket have landed; http://review.whamcloud.com/#/c/10601/ and http://review.whamcloud.com/#/c/6339 . Are you still experiencing this problem? If not, should be close this ticket?
            green Oleg Drokin added a comment -

            I think you are right and this instance is different from before due to network issues.

            green Oleg Drokin added a comment - I think you are right and this instance is different from before due to network issues.
            ezell Matt Ezell added a comment -

            We had two "still busy" messages since we last patched, but there were special circumstances:

            Mar  8 09:47:54 atlas-mds1 kernel: [159420.993577] Lustre: atlas1-MDT0000: Client b4ea3026-13aa-b280-3f2e-94a4bab53a72 (at 95@gni100) refused reconnection, still busy with 1 active RPCs
            Mar  8 09:47:55 atlas-mds1 kernel: [159421.712317] Lustre: atlas1-MDT0000: Client 9fc952d9-a299-68de-a80e-21913bc6e90d (at 9311@gni100) refused reconnection, still busy with 1 active RPCs
            

            Around the same time, a Gemini network error on our Cray caused the network to quiesce and re-route.

            2014-03-08T09:47:27.602931-05:00 c24-0c0s0n3 Lustre: 20310:0:(client.c:1533:ptlrpc_expire_one_request()) @@@ Request x1459682730665633 sent from MGC10.36.226.77@o2ib200 to NID 10.36.226.77@o2ib200 17s ago has timed out (17s prior to deadline).
            2014-03-08T09:47:27.602945-05:00 c24-0c0s0n3  req@ffff8803ff9e3000 x1459682730665633/t0 o400->MGS@MGC10.36.226.77@o2ib200_0:26/25 lens 192/384 e 0 to 1 dl 1394290046 ref 2 fl Rpc:N/0/0 rc 0/0
            2014-03-08T09:47:27.602952-05:00 c24-0c0s0n3 Lustre: 20310:0:(client.c:1533:ptlrpc_expire_one_request()) Skipped 19 previous similar messages
            2014-03-08T09:47:27.602961-05:00 c24-0c0s0n3 LustreError: 166-1: MGC10.36.226.77@o2ib200: Connection to service MGS via nid 10.36.226.77@o2ib200 was lost; in progress operations using this service will fail.
            2014-03-08T09:47:27.602970-05:00 c24-0c0s0n3 Lustre: linkfarm-MDT0000-mdc-ffff880407978c00: Connection to service linkfarm-MDT0000 via nid 10.36.226.83@o2ib200 was lost; in progress operations using this service will wait for recovery to complete.
            2014-03-08T09:47:33.651947-05:00 c24-0c0s0n3 LNet: Waiting for hardware quiesce flag to clear
            2014-03-08T09:47:57.909338-05:00 c24-0c0s0n3 LNet: 17880:0:(lib-move.c:1828:lnet_parse_put()) Dropping PUT from 12345-10.36.225.169@o2ib218 portal 4 match 1459682730665920 offset 192 length 192: 2
            2014-03-08T09:47:57.909351-05:00 c24-0c0s0n3 LNet: 17880:0:(lib-move.c:1828:lnet_parse_put()) Skipped 1 previous similar message
            2014-03-08T09:47:57.909358-05:00 c24-0c0s0n3 Lustre: atlas2-MDT0000-mdc-ffff8802ad97c400: Connection restored to service atlas2-MDT0000 using nid 10.36.226.74@o2ib200.
            2014-03-08T09:47:57.909366-05:00 c24-0c0s0n3 LNet: 17880:0:(lib-move.c:1828:lnet_parse_put()) Dropping PUT from 12345-10.36.225.245@o2ib219 portal 4 match 1459682730666861 offset 192 length 192: 2
            2014-03-08T09:47:57.909425-05:00 c24-0c0s0n3 LNet: 17884:0:(lib-move.c:1828:lnet_parse_put()) Dropping PUT from 12345-10.36.225.42@o2ib203 portal 4 match 1459682730665937 offset 192 length 192: 2
            2014-03-08T09:47:57.909432-05:00 c24-0c0s0n3 LNet: 17884:0:(lib-move.c:1828:lnet_parse_put()) Skipped 1516 previous similar messages
            

            I think the clients sent requests, but the response didn't get back to them before the network timed them out. I'm not sure why it timed out, I thought during a quiesce situation gnilnd would extend timeouts. When the client timed out the request, it unlinked the MD, so the responses from the MDS were dropped. The MDS didn't get a NACK, so it thinks it's still waiting on that request to be completed.

            This is probably a special situation, and likely cannot explain all the occurrences we've seen of this previously.

            ezell Matt Ezell added a comment - We had two "still busy" messages since we last patched, but there were special circumstances: Mar 8 09:47:54 atlas-mds1 kernel: [159420.993577] Lustre: atlas1-MDT0000: Client b4ea3026-13aa-b280-3f2e-94a4bab53a72 (at 95@gni100) refused reconnection, still busy with 1 active RPCs Mar 8 09:47:55 atlas-mds1 kernel: [159421.712317] Lustre: atlas1-MDT0000: Client 9fc952d9-a299-68de-a80e-21913bc6e90d (at 9311@gni100) refused reconnection, still busy with 1 active RPCs Around the same time, a Gemini network error on our Cray caused the network to quiesce and re-route. 2014-03-08T09:47:27.602931-05:00 c24-0c0s0n3 Lustre: 20310:0:(client.c:1533:ptlrpc_expire_one_request()) @@@ Request x1459682730665633 sent from MGC10.36.226.77@o2ib200 to NID 10.36.226.77@o2ib200 17s ago has timed out (17s prior to deadline). 2014-03-08T09:47:27.602945-05:00 c24-0c0s0n3 req@ffff8803ff9e3000 x1459682730665633/t0 o400->MGS@MGC10.36.226.77@o2ib200_0:26/25 lens 192/384 e 0 to 1 dl 1394290046 ref 2 fl Rpc:N/0/0 rc 0/0 2014-03-08T09:47:27.602952-05:00 c24-0c0s0n3 Lustre: 20310:0:(client.c:1533:ptlrpc_expire_one_request()) Skipped 19 previous similar messages 2014-03-08T09:47:27.602961-05:00 c24-0c0s0n3 LustreError: 166-1: MGC10.36.226.77@o2ib200: Connection to service MGS via nid 10.36.226.77@o2ib200 was lost; in progress operations using this service will fail. 2014-03-08T09:47:27.602970-05:00 c24-0c0s0n3 Lustre: linkfarm-MDT0000-mdc-ffff880407978c00: Connection to service linkfarm-MDT0000 via nid 10.36.226.83@o2ib200 was lost; in progress operations using this service will wait for recovery to complete. 2014-03-08T09:47:33.651947-05:00 c24-0c0s0n3 LNet: Waiting for hardware quiesce flag to clear 2014-03-08T09:47:57.909338-05:00 c24-0c0s0n3 LNet: 17880:0:(lib-move.c:1828:lnet_parse_put()) Dropping PUT from 12345-10.36.225.169@o2ib218 portal 4 match 1459682730665920 offset 192 length 192: 2 2014-03-08T09:47:57.909351-05:00 c24-0c0s0n3 LNet: 17880:0:(lib-move.c:1828:lnet_parse_put()) Skipped 1 previous similar message 2014-03-08T09:47:57.909358-05:00 c24-0c0s0n3 Lustre: atlas2-MDT0000-mdc-ffff8802ad97c400: Connection restored to service atlas2-MDT0000 using nid 10.36.226.74@o2ib200. 2014-03-08T09:47:57.909366-05:00 c24-0c0s0n3 LNet: 17880:0:(lib-move.c:1828:lnet_parse_put()) Dropping PUT from 12345-10.36.225.245@o2ib219 portal 4 match 1459682730666861 offset 192 length 192: 2 2014-03-08T09:47:57.909425-05:00 c24-0c0s0n3 LNet: 17884:0:(lib-move.c:1828:lnet_parse_put()) Dropping PUT from 12345-10.36.225.42@o2ib203 portal 4 match 1459682730665937 offset 192 length 192: 2 2014-03-08T09:47:57.909432-05:00 c24-0c0s0n3 LNet: 17884:0:(lib-move.c:1828:lnet_parse_put()) Skipped 1516 previous similar messages I think the clients sent requests, but the response didn't get back to them before the network timed them out. I'm not sure why it timed out, I thought during a quiesce situation gnilnd would extend timeouts. When the client timed out the request, it unlinked the MD, so the responses from the MDS were dropped. The MDS didn't get a NACK, so it thinks it's still waiting on that request to be completed. This is probably a special situation, and likely cannot explain all the occurrences we've seen of this previously.
            green Oleg Drokin added a comment -

            I guess another bit of useful info would be if you can cross reference jobs run on the nodes that enter this stuck state and could not reconnect due to BUSY on MDT.
            Is it by any chance the same sort of job?

            green Oleg Drokin added a comment - I guess another bit of useful info would be if you can cross reference jobs run on the nodes that enter this stuck state and could not reconnect due to BUSY on MDT. Is it by any chance the same sort of job?
            green Oleg Drokin added a comment -

            Hm, looking at the traces I see 134 threads waiting for object allocations from OSTs.
            355 threads are waiting for locks to be cancelled.

            This is nothing too critical I imagine.
            There's nothing waiting in the readdir, so that angle did not realize, but all evicted locks are still PR locks pointing at this readdir lock somehow.

            If you can manage to dump sysrq-t trace from a node that's trying to reconnect and get's refused with 'refused reconnection, still busy with X active RPCs' while this is ongoing, I am interested in getting that.

            green Oleg Drokin added a comment - Hm, looking at the traces I see 134 threads waiting for object allocations from OSTs. 355 threads are waiting for locks to be cancelled. This is nothing too critical I imagine. There's nothing waiting in the readdir, so that angle did not realize, but all evicted locks are still PR locks pointing at this readdir lock somehow. If you can manage to dump sysrq-t trace from a node that's trying to reconnect and get's refused with 'refused reconnection, still busy with X active RPCs' while this is ongoing, I am interested in getting that.

            I uploaded the MDS strack traces to ftp.whamcloud.com/uploads/LU-4570/mds1.20140227.console.log.bz2

            simmonsja James A Simmons added a comment - I uploaded the MDS strack traces to ftp.whamcloud.com/uploads/ LU-4570 /mds1.20140227.console.log.bz2

            console output of echo t > /proc/sysrq-trigger under load.

            hilljjornl Jason Hill (Inactive) added a comment - console output of echo t > /proc/sysrq-trigger under load.

            People

              green Oleg Drokin
              hilljjornl Jason Hill (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              25 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: