Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1504

the /lustre filesystem was unusable for an extended period due to a single OST's dropping out of service

Details

    • Task
    • Resolution: Fixed
    • Major
    • None
    • Lustre 2.0.0, Lustre 1.8.6
    • Clustering
    • 4043

    Description

      Hello Support,

      One of customer at University of Delaware had at least three separate instances where the /lustre filesystem was unusable for an extended period due to a single OST's dropping out of service due to:

      Jun 11 02:40:07 oss4 kernel: Lustre: 9443:0:(ldlm_lib.c:874:target_handle_connect()) lustre-OST0016: refuse reconnection from d085b4f1-e418-031f-8474-b980894ce7ad@10.55.50.115@o2ib to 0xffff8103119bac00; still busy with 1 active RPCs

      The hang was so bad for one of them (upwards of 30 minutes with the OST unavailable) that a reboot of the oss1/oss2 pair was necessary. The symptom is easily identified: long hangs on the head node while one waits for a directory listing or for a file to open for editing in vi, etc. Sometimes the situation remedies itself, sometimes it does not and we need to reboot one or more OSS nodes.

      "Enclosed are all of the syslogs, dmesg, and /tmp/lustre* crash dumps for our MDS and OSS's."

      You can retrieve the drop-off anytime in the next 21 days by clicking the following link (or copying and pasting it into your web browser):

      "https://pandora.nss.udel.edu//pickup.php?claimID=vuAFoSBUoReVuaje&claimPasscode=RfTmXJZFVdUGzbLk&emailAddr=tsingh%40penguincomputing.com"

      Full information for the drop-off:

      Claim ID: vuAFoSBUoReVuaje
      Claim Passcode: RfTmXJZFVdUGzbLk
      Date of Drop-Off: 2012-06-11 12:23:20-0400

      Please review the attached log files and provide us the next course of action since it's very critical issue and impacting their environment? Also please let me know
      if you need any further info?

      Thanks
      Terry
      Penguin Tech Support
      Ph: 415-954-2833

      Attachments

        1. headnode-messages.gz
          623 kB
        2. lustre-failure-120619-1.gz
          21 kB
        3. mds0a-messages.gz
          5 kB
        4. oss3-vmstat.log
          9 kB
        5. oss4-messages.gz
          52 kB

        Issue Links

          Activity

            [LU-1504] the /lustre filesystem was unusable for an extended period due to a single OST's dropping out of service
            green Oleg Drokin added a comment -

            I believe your problem does not look like LU-416 which was a client side problem )and the one specific to 2.x releases only at that too).
            You are clearly having a problem on the server side.
            Looks like your disk devices are overloaded with small IO and there is very little Lustre can do about it because apparently thisis what the application in question wants.

            green Oleg Drokin added a comment - I believe your problem does not look like LU-416 which was a client side problem )and the one specific to 2.x releases only at that too). You are clearly having a problem on the server side. Looks like your disk devices are overloaded with small IO and there is very little Lustre can do about it because apparently thisis what the application in question wants.

            Cliff:

            Further research on your Jira site has shown that the issue we're seeing is EXACTLY the situation reported in LU-416:

            http://jira.whamcloud.com/browse/LU-416

            To what version of Lustre does that incident correspond – 1.8.6? – and is it resolved in 1.8.8?

            ::::::::::::::::::::::::::::::::::::::::::::::::::::::
            Jeffrey T. Frey, Ph.D.
            Systems Programmer IV / Cluster Management
            Network & Systems Services / College of Engineering
            University of Delaware, Newark DE 19716
            Office: (302) 831-6034 Mobile: (302) 419-4976
            http://turin.nss.udel.edu/
            ::::::::::::::::::::::::::::::::::::::::::::::::::::::

            frey@udel.edu Jeffrey Frey added a comment - Cliff: Further research on your Jira site has shown that the issue we're seeing is EXACTLY the situation reported in LU-416 : http://jira.whamcloud.com/browse/LU-416 To what version of Lustre does that incident correspond – 1.8.6? – and is it resolved in 1.8.8? :::::::::::::::::::::::::::::::::::::::::::::::::::::: Jeffrey T. Frey, Ph.D. Systems Programmer IV / Cluster Management Network & Systems Services / College of Engineering University of Delaware, Newark DE 19716 Office: (302) 831-6034 Mobile: (302) 419-4976 http://turin.nss.udel.edu/ ::::::::::::::::::::::::::::::::::::::::::::::::::::::
            millerbe@udel.edu Ben Miller added a comment -

            We continue to have temporary OST unavailability with the read cache
            disabled. We haven't had a chance to upgrade to 1.8.8 yet. Sometimes
            the OST will recover within a few minutes other times after several
            minutes (or hours) we end up rebooted an OSS to two to get Lustre
            available again.

            Ben

            millerbe@udel.edu Ben Miller added a comment - We continue to have temporary OST unavailability with the read cache disabled. We haven't had a chance to upgrade to 1.8.8 yet. Sometimes the OST will recover within a few minutes other times after several minutes (or hours) we end up rebooted an OSS to two to get Lustre available again. Ben

            Please update us as to your status. What else we do to help on this issue? Should we close this bug?

            cliffw Cliff White (Inactive) added a comment - Please update us as to your status. What else we do to help on this issue? Should we close this bug?

            What is your current state? What else can we do to assist?

            cliffw Cliff White (Inactive) added a comment - What is your current state? What else can we do to assist?

            We have never had data loss from a rolling upgrade, or from a point upgrade of this type. We routinely test these point release upgrade/downgrades as a part of the release process for each new Lustre release.

            If there are other dependencies involved (such as Qlogic tools) then it would be advisable to involve Penguin in the decision-making process about how to proceed.

            While upgrading your Lustre version provides the benefit from other fixes included, it is quite possible to apply the exact fix you need to your existing version of Lustre, which may be less disruptive on other layers of the stack.

            Our clients use a stock RedHat kernel and you can certainly do a gradual upgrade of the clients as you schedule permits, there is no problem running with a mix of client versions.

            Before embarking on such an upgrade path you could even experiment with a single client to see whether the pay-off warrants the effort. To do this effectively you would need to identify an example of the kind of workload that triggers this issue so that you can measure the before and after effect of applying this fix. Do you have any idea as to a likely candidate for such a test?

            cliffw Cliff White (Inactive) added a comment - We have never had data loss from a rolling upgrade, or from a point upgrade of this type. We routinely test these point release upgrade/downgrades as a part of the release process for each new Lustre release. If there are other dependencies involved (such as Qlogic tools) then it would be advisable to involve Penguin in the decision-making process about how to proceed. While upgrading your Lustre version provides the benefit from other fixes included, it is quite possible to apply the exact fix you need to your existing version of Lustre, which may be less disruptive on other layers of the stack. Our clients use a stock RedHat kernel and you can certainly do a gradual upgrade of the clients as you schedule permits, there is no problem running with a mix of client versions. Before embarking on such an upgrade path you could even experiment with a single client to see whether the pay-off warrants the effort. To do this effectively you would need to identify an example of the kind of workload that triggers this issue so that you can measure the before and after effect of applying this fix. Do you have any idea as to a likely candidate for such a test?
            frey@udel.edu Jeffrey Frey added a comment -

            QLogic's OFED stack is a superset of the stock OFED distribution. It includes value-added utilities/libraries and bug-fixes [that OFED hypothetically rolls back into the baseline distribution].

            When we initially setup the head node w/ ScientificLinux 6.1 the stock RDMA kernel module kept the machine from booting (kernel panic every time). Wiping the RHEL IB stack and replacing with the QLogic one for RHEL 6.1 fixed the issue.

            As configured by Penguin, the LUSTRE servers are all running CentOS 5.x.

            [root@oss1 ~]# ibstat
            CA 'qib0'
            CA type: InfiniPath_QLE7340
            Number of ports: 1
            Firmware version:
            Hardware version: 2
            Node GUID: 0x001175000077592a
            System image GUID: 0x001175000077592a
            Port 1:
            State: Active
            Physical state: LinkUp
            Rate: 40
            Base lid: 219
            LMC: 0
            SM lid: 1
            Capability mask: 0x07610868
            Port GUID: 0x001175000077592a
            Link layer: IB

            So in your (Whamcloud's) experience there have never been any instances of data loss due to upgrading like this on-the-fly while the filesystem is still online?

            We're using the loadable module variant of the client rather than building a LUSTRE kernel for the clients or using a Whamcloud kernel. Is there anything about the 1.8.8 server that demands 1.8.8 clients? Upgrading all clients = lots of downtime for our users and they've already experienced a LOT of downtime/lost time thanks to LUSTRE.

            ::::::::::::::::::::::::::::::::::::::::::::::::::::::
            Jeffrey T. Frey, Ph.D.
            Systems Programmer IV / Cluster Management
            Network & Systems Services / College of Engineering
            University of Delaware, Newark DE 19716
            Office: (302) 831-6034 Mobile: (302) 419-4976
            http://turin.nss.udel.edu/
            ::::::::::::::::::::::::::::::::::::::::::::::::::::::

            frey@udel.edu Jeffrey Frey added a comment - QLogic's OFED stack is a superset of the stock OFED distribution. It includes value-added utilities/libraries and bug-fixes [that OFED hypothetically rolls back into the baseline distribution] . When we initially setup the head node w/ ScientificLinux 6.1 the stock RDMA kernel module kept the machine from booting (kernel panic every time). Wiping the RHEL IB stack and replacing with the QLogic one for RHEL 6.1 fixed the issue. As configured by Penguin, the LUSTRE servers are all running CentOS 5.x. [root@oss1 ~] # ibstat CA 'qib0' CA type: InfiniPath_QLE7340 Number of ports: 1 Firmware version: Hardware version: 2 Node GUID: 0x001175000077592a System image GUID: 0x001175000077592a Port 1: State: Active Physical state: LinkUp Rate: 40 Base lid: 219 LMC: 0 SM lid: 1 Capability mask: 0x07610868 Port GUID: 0x001175000077592a Link layer: IB So in your (Whamcloud's) experience there have never been any instances of data loss due to upgrading like this on-the-fly while the filesystem is still online? We're using the loadable module variant of the client rather than building a LUSTRE kernel for the clients or using a Whamcloud kernel. Is there anything about the 1.8.8 server that demands 1.8.8 clients? Upgrading all clients = lots of downtime for our users and they've already experienced a LOT of downtime/lost time thanks to LUSTRE. :::::::::::::::::::::::::::::::::::::::::::::::::::::: Jeffrey T. Frey, Ph.D. Systems Programmer IV / Cluster Management Network & Systems Services / College of Engineering University of Delaware, Newark DE 19716 Office: (302) 831-6034 Mobile: (302) 419-4976 http://turin.nss.udel.edu/ ::::::::::::::::::::::::::::::::::::::::::::::::::::::

            As far as we know, Qlogic cards are supported by the OFED supplied with the RedHat kernel - was there a reason giving for using external OFED?
            Please give us the model numbers of the cards.
            Your upgrade procedure is correct, and identical to that in the Lustre Manual. The change from 1.8.6 to 1.8.8 is a point release, there are absolutely no concerns about compatibility and no need for any special work. The clients may require a RedHat kernel update to match our client kernel.
            There is afaik not much information beyond the Lustre Manual, the upgrade should not be difficult.
            Please let us know when you are planning this, and we'd be glad to have a conference call or furnish other help.
            Has disabling the read cache produced any change in the performance?

            cliffw Cliff White (Inactive) added a comment - As far as we know, Qlogic cards are supported by the OFED supplied with the RedHat kernel - was there a reason giving for using external OFED? Please give us the model numbers of the cards. Your upgrade procedure is correct, and identical to that in the Lustre Manual. The change from 1.8.6 to 1.8.8 is a point release, there are absolutely no concerns about compatibility and no need for any special work. The clients may require a RedHat kernel update to match our client kernel. There is afaik not much information beyond the Lustre Manual, the upgrade should not be difficult. Please let us know when you are planning this, and we'd be glad to have a conference call or furnish other help. Has disabling the read cache produced any change in the performance?
            frey@udel.edu Jeffrey Frey added a comment -

            Cliff:

            With respect to upgrading the LUSTRE release on these production servers:

            (1) Since these systems have QLogic's own OFED installed on them, we would want to build the LUSTRE 1.8.8 server components from scratch, correct? It appears this is what Penguin did when the system was built:

            [root@oss1 src]# cd /usr/src
            [root@oss1 src]# ls
            aacraid-1.1.7.28000 debug kernels ofa_kernel ofa_kernel-1.5.3 openib redhat
            [root@oss1 src]# ls kernels/
            2.6.18-238.12.1.el5_lustre.g266a955-x86_64 2.6.18-238.el5-x86_64

            (2) For rolling upgrades, we're assuming the general order of operations would be:

            • upgrade spare MDS
            • failover MDS to upgraded spare, upgrade primary MDS
            • failover oss1 => oss2; upgrade oss1; failover oss2 => oss1; upgrade oss2
            • failover oss3 => oss4; upgrade oss3; failover oss4 => oss3; upgrade oss4

            We'd appreciate citations of any important/helpful materials on the subject of LUSTRE rolling upgrades.

            Thanks for any and all information/feedback you can provide.

            ::::::::::::::::::::::::::::::::::::::::::::::::::::::
            Jeffrey T. Frey, Ph.D.
            Systems Programmer IV / Cluster Management
            Network & Systems Services / College of Engineering
            University of Delaware, Newark DE 19716
            Office: (302) 831-6034 Mobile: (302) 419-4976
            http://turin.nss.udel.edu/
            ::::::::::::::::::::::::::::::::::::::::::::::::::::::

            frey@udel.edu Jeffrey Frey added a comment - Cliff: With respect to upgrading the LUSTRE release on these production servers: (1) Since these systems have QLogic's own OFED installed on them, we would want to build the LUSTRE 1.8.8 server components from scratch, correct? It appears this is what Penguin did when the system was built: [root@oss1 src] # cd /usr/src [root@oss1 src] # ls aacraid-1.1.7.28000 debug kernels ofa_kernel ofa_kernel-1.5.3 openib redhat [root@oss1 src] # ls kernels/ 2.6.18-238.12.1.el5_lustre.g266a955-x86_64 2.6.18-238.el5-x86_64 (2) For rolling upgrades, we're assuming the general order of operations would be: upgrade spare MDS failover MDS to upgraded spare, upgrade primary MDS failover oss1 => oss2; upgrade oss1; failover oss2 => oss1; upgrade oss2 failover oss3 => oss4; upgrade oss3; failover oss4 => oss3; upgrade oss4 We'd appreciate citations of any important/helpful materials on the subject of LUSTRE rolling upgrades. Thanks for any and all information/feedback you can provide. :::::::::::::::::::::::::::::::::::::::::::::::::::::: Jeffrey T. Frey, Ph.D. Systems Programmer IV / Cluster Management Network & Systems Services / College of Engineering University of Delaware, Newark DE 19716 Office: (302) 831-6034 Mobile: (302) 419-4976 http://turin.nss.udel.edu/ ::::::::::::::::::::::::::::::::::::::::::::::::::::::
            millerbe@udel.edu Ben Miller added a comment -

            OK, thanks. I turned off the read cache. We'll let you know if it
            helps. We'll discuss internally about upgrading soon.

            Ben

            millerbe@udel.edu Ben Miller added a comment - OK, thanks. I turned off the read cache. We'll let you know if it helps. We'll discuss internally about upgrading soon. Ben
            cliffw Cliff White (Inactive) added a comment - - edited

            Thanks for providing the information so far. Analyzing the results has been useful in understanding what is going on.

            From the IO statistics, we are seeing a great deal of small file IO (4k io size), and very little parallel IO (most of the time only 1 IO in flight). This is not an optimal IO model for Lustre - any steps you can take on the application side to increase IO size, eliminate excessive flush() or sync() calls, or otherwise allow the filesystem to aggregate larger IO will help to improve your performance.

            Given this IO pattern the Lustre read cache – which is on by default - may be doing more harm than good. To turn it off please run the command "lctl set_param obdfilter.*.read_cache_enable=0" on all OSS nodes.

            Finally, we recommend an immediate upgrade to Lustre 1.8.8-wc1 as this release contains optimizations for small file IO (see http://jira.whamcloud.com/browse/LU-983).  The Lustre 1.8.6-wc1 and 1.8.8-wc1 releases are completely compatible, so you can do a rolling upgrade of your systems without needing any downtime.

            Please let me know if you have any further questions about any of the above and let us know whether this advice helps.

            cliffw Cliff White (Inactive) added a comment - - edited Thanks for providing the information so far. Analyzing the results has been useful in understanding what is going on. From the IO statistics, we are seeing a great deal of small file IO (4k io size), and very little parallel IO (most of the time only 1 IO in flight). This is not an optimal IO model for Lustre - any steps you can take on the application side to increase IO size, eliminate excessive flush() or sync() calls, or otherwise allow the filesystem to aggregate larger IO will help to improve your performance. Given this IO pattern the Lustre read cache – which is on by default - may be doing more harm than good. To turn it off please run the command "lctl set_param obdfilter.*.read_cache_enable=0" on all OSS nodes. Finally, we recommend an immediate upgrade to Lustre 1.8.8-wc1 as this release contains optimizations for small file IO (see http://jira.whamcloud.com/browse/LU-983 ).  The Lustre 1.8.6-wc1 and 1.8.8-wc1 releases are completely compatible, so you can do a rolling upgrade of your systems without needing any downtime. Please let me know if you have any further questions about any of the above and let us know whether this advice helps.

            People

              cliffw Cliff White (Inactive)
              adizon Archie Dizon
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: