Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1504

the /lustre filesystem was unusable for an extended period due to a single OST's dropping out of service

Details

    • Task
    • Resolution: Fixed
    • Major
    • None
    • Lustre 2.0.0, Lustre 1.8.6
    • Clustering
    • 4043

    Description

      Hello Support,

      One of customer at University of Delaware had at least three separate instances where the /lustre filesystem was unusable for an extended period due to a single OST's dropping out of service due to:

      Jun 11 02:40:07 oss4 kernel: Lustre: 9443:0:(ldlm_lib.c:874:target_handle_connect()) lustre-OST0016: refuse reconnection from d085b4f1-e418-031f-8474-b980894ce7ad@10.55.50.115@o2ib to 0xffff8103119bac00; still busy with 1 active RPCs

      The hang was so bad for one of them (upwards of 30 minutes with the OST unavailable) that a reboot of the oss1/oss2 pair was necessary. The symptom is easily identified: long hangs on the head node while one waits for a directory listing or for a file to open for editing in vi, etc. Sometimes the situation remedies itself, sometimes it does not and we need to reboot one or more OSS nodes.

      "Enclosed are all of the syslogs, dmesg, and /tmp/lustre* crash dumps for our MDS and OSS's."

      You can retrieve the drop-off anytime in the next 21 days by clicking the following link (or copying and pasting it into your web browser):

      "https://pandora.nss.udel.edu//pickup.php?claimID=vuAFoSBUoReVuaje&claimPasscode=RfTmXJZFVdUGzbLk&emailAddr=tsingh%40penguincomputing.com"

      Full information for the drop-off:

      Claim ID: vuAFoSBUoReVuaje
      Claim Passcode: RfTmXJZFVdUGzbLk
      Date of Drop-Off: 2012-06-11 12:23:20-0400

      Please review the attached log files and provide us the next course of action since it's very critical issue and impacting their environment? Also please let me know
      if you need any further info?

      Thanks
      Terry
      Penguin Tech Support
      Ph: 415-954-2833

      Attachments

        1. headnode-messages.gz
          623 kB
        2. lustre-failure-120619-1.gz
          21 kB
        3. mds0a-messages.gz
          5 kB
        4. oss3-vmstat.log
          9 kB
        5. oss4-messages.gz
          52 kB

        Issue Links

          Activity

            [LU-1504] the /lustre filesystem was unusable for an extended period due to a single OST's dropping out of service

            Please reopen this issue if you have further comments/questions.

            cliffw Cliff White (Inactive) added a comment - Please reopen this issue if you have further comments/questions.
            green Oleg Drokin added a comment -

            I wonder about the "185 of 200 clients have no problem accessing the OST" case you mention. Could it be the case that only 15 "problematic" clients are actaully doing any IO on that OST and the rest happen to do IO on other OSTs?
            After all overloaded OST is overloaded OST. Lustre cannot really improve disk controller properties and if the controller hates lots of small IO (and frankly, a lot of them does), there is very little we can do about it.

            People in big government labs have these problems too and the solutions range from changing hardware to limiting "bad-behaving" applications to only certain parts of the machine to actually rewriting the apps to form a better IO pattern.

            The 100k lock count should not be affecting anything as long as you don't run out of memory on OSS nodes, but looking at the vmstat output its evident you don't have this problem.

            brw_stat output shows a lot of 4k io, and it only tracks data IO by applications.

            green Oleg Drokin added a comment - I wonder about the "185 of 200 clients have no problem accessing the OST" case you mention. Could it be the case that only 15 "problematic" clients are actaully doing any IO on that OST and the rest happen to do IO on other OSTs? After all overloaded OST is overloaded OST. Lustre cannot really improve disk controller properties and if the controller hates lots of small IO (and frankly, a lot of them does), there is very little we can do about it. People in big government labs have these problems too and the solutions range from changing hardware to limiting "bad-behaving" applications to only certain parts of the machine to actually rewriting the apps to form a better IO pattern. The 100k lock count should not be affecting anything as long as you don't run out of memory on OSS nodes, but looking at the vmstat output its evident you don't have this problem. brw_stat output shows a lot of 4k io, and it only tracks data IO by applications.
            cliffw Cliff White (Inactive) added a comment - - edited

            The random 4k IO was identified in the brw_stats which you furnished to us - to quote that comment(18/Jun/12 11:18 AM):
            "2. On each server for each OST, there is a 'brw_stats' file located on the path
            /proc/fs/lustre/obdfilter/<OST ID>/brw_stats where 'OST ID' is of the format lustre-OST00XX. for each of your OSTs, from a login on the OSS please issue the command:
            '# cat /proc/fs/lustre/obdfilter/<OST ID>/brw_stats > <file>'
            where <file> is the OST ID. Please do this when the server load is high. "

            You can examine these files using that method at any time to see how your IO is performing, and can clear the data by echoing 'clear' into the file.
            In the data you furnished to us previous, we see this, as an example:

            # cat lustre-OST0012
            snapshot_time:         1340104119.496271 (secs.usecs)
            
                                       read      |     write
            pages per bulk r/w     rpcs  % cum % |  rpcs  % cum %
            1:                  491976  73  73   | 408911  59  59
            2:                    4588   0  74   | 16816   2  62
            4:                    1431   0  74   | 10062   1  63
            8:                    1156   0  74   | 19221   2  66
            16:                   1077   0  74   | 7433   1  67
            32:                   1731   0  74   | 13036   1  69
            64:                   2898   0  75   | 8544   1  70
            128:                  5149   0  76   | 12009   1  72
            256:                160523  23 100   | 187956  27 100
            

            A 'page' in this case is a 4k memory page. So we can see here that 73% of your read IO and 59% of your write IO was 4k in size. This is rather more that we would expect from the odd ls, etc, and likely indicates an application behavior. Again, this is why we advise the 1.8.8 upgrade. The 1.8.8 release has been out for quite some time, and is stable in numerous production deployments.

            cliffw Cliff White (Inactive) added a comment - - edited The random 4k IO was identified in the brw_stats which you furnished to us - to quote that comment(18/Jun/12 11:18 AM): "2. On each server for each OST, there is a 'brw_stats' file located on the path /proc/fs/lustre/obdfilter/<OST ID>/brw_stats where 'OST ID' is of the format lustre-OST00XX. for each of your OSTs, from a login on the OSS please issue the command: '# cat /proc/fs/lustre/obdfilter/<OST ID>/brw_stats > <file>' where <file> is the OST ID. Please do this when the server load is high. " You can examine these files using that method at any time to see how your IO is performing, and can clear the data by echoing 'clear' into the file. In the data you furnished to us previous, we see this, as an example: # cat lustre-OST0012 snapshot_time: 1340104119.496271 (secs.usecs) read | write pages per bulk r/w rpcs % cum % | rpcs % cum % 1: 491976 73 73 | 408911 59 59 2: 4588 0 74 | 16816 2 62 4: 1431 0 74 | 10062 1 63 8: 1156 0 74 | 19221 2 66 16: 1077 0 74 | 7433 1 67 32: 1731 0 74 | 13036 1 69 64: 2898 0 75 | 8544 1 70 128: 5149 0 76 | 12009 1 72 256: 160523 23 100 | 187956 27 100 A 'page' in this case is a 4k memory page. So we can see here that 73% of your read IO and 59% of your write IO was 4k in size. This is rather more that we would expect from the odd ls, etc, and likely indicates an application behavior. Again, this is why we advise the 1.8.8 upgrade. The 1.8.8 release has been out for quite some time, and is stable in numerous production deployments.
            frey@udel.edu Jeffrey Frey added a comment -

            Thanks, Cliff. We cannot "jump" into the 1.8.8 upgrade for two important reasons:

            (1) We like to be very aware of what's changed in a software package before we install it on a production system – so we have some reading ahead of us to get familiar with the 1.8.6 to 1.8.8 roadmap

            (2) This is a production system with user's computation work running out of the Lustre filesystem – any disruption is a major issue for our users, so we try to minimize those

            In theory a rolling upgrade works flawlessly – but in theory the hardware we were sold behind this Lustre filesystem is enterprise-class, fast storage, too :-\ So our users have to be notified of the possibility for failure, which means if there could be downtime involved we may as well do other patching at the same time. Just so happens QLogic released a firmware update for our switch recently. We plan to do the 1.8.8 upgrade within the next month, we just have some planning and scheduling to do first.

            As for the random 4k i/o, for our sake can you show applicable portions of the logs/debug traces that led to that diagnosis? In context is it all block-level i/o to/from files, or could it be standard filesystem ops (ls, cp, tar, etc.) e.g. also fetching size metadata from the OST(s)?

            ::::::::::::::::::::::::::::::::::::::::::::::::::::::
            Jeffrey T. Frey, Ph.D.
            Systems Programmer IV / Cluster Management
            Network & Systems Services / College of Engineering
            University of Delaware, Newark DE 19716
            Office: (302) 831-6034 Mobile: (302) 419-4976
            http://turin.nss.udel.edu/
            ::::::::::::::::::::::::::::::::::::::::::::::::::::::

            frey@udel.edu Jeffrey Frey added a comment - Thanks, Cliff. We cannot "jump" into the 1.8.8 upgrade for two important reasons: (1) We like to be very aware of what's changed in a software package before we install it on a production system – so we have some reading ahead of us to get familiar with the 1.8.6 to 1.8.8 roadmap (2) This is a production system with user's computation work running out of the Lustre filesystem – any disruption is a major issue for our users, so we try to minimize those In theory a rolling upgrade works flawlessly – but in theory the hardware we were sold behind this Lustre filesystem is enterprise-class, fast storage, too :-\ So our users have to be notified of the possibility for failure, which means if there could be downtime involved we may as well do other patching at the same time. Just so happens QLogic released a firmware update for our switch recently. We plan to do the 1.8.8 upgrade within the next month, we just have some planning and scheduling to do first. As for the random 4k i/o, for our sake can you show applicable portions of the logs/debug traces that led to that diagnosis? In context is it all block-level i/o to/from files, or could it be standard filesystem ops (ls, cp, tar, etc.) e.g. also fetching size metadata from the OST(s)? :::::::::::::::::::::::::::::::::::::::::::::::::::::: Jeffrey T. Frey, Ph.D. Systems Programmer IV / Cluster Management Network & Systems Services / College of Engineering University of Delaware, Newark DE 19716 Office: (302) 831-6034 Mobile: (302) 419-4976 http://turin.nss.udel.edu/ ::::::::::::::::::::::::::::::::::::::::::::::::::::::

            We have already recommended that you upgrade to 1.8.8 to improve small file IO. We understand that you run a mix of workloads, we believe if you could identify which workload is doing the random 4k IO, changes to that workload would help the situation.

            cliffw Cliff White (Inactive) added a comment - We have already recommended that you upgrade to 1.8.8 to improve small file IO. We understand that you run a mix of workloads, we believe if you could identify which workload is doing the random 4k IO, changes to that workload would help the situation.
            frey@udel.edu Jeffrey Frey added a comment -

            Interesting that this is a 2.x problem; the fact remains, though, that the symptoms we observe on the client side are EXACTLY the same as the ticket I mentioned, so we were hoping that would provide you folks with a better idea of what's going on. The problem seems like it has to be partly client-oriented, since 185 of the 200 Lustre clients were using the OST without issue while 15 of the clients showed it as unavailable. On the cluster head node where we noticed this recent outage:

            [root@mills lustre-OST0006-osc-ffff880d13d29000]# pwd
            /proc/fs/lustre/ldlm/namespaces/lustre-OST0006-osc-ffff880d13d29000

            [root@mills lustre-OST0006-osc-ffff880d13d29000]# more lock_*
            ::::::::::::::
            lock_count
            ::::::::::::::
            99554
            ::::::::::::::
            lock_unused_count
            ::::::::::::::
            0

            Should we read this statement as "use Lustre for applications with a couple large files to access?" There is no single application here since this is a community cluster with many different applications, and Lustre was intended by us to be used as working space for users' jobs on the cluster (a'la scratch).

            On Jul 12, 2012, at 12:51 PM, Oleg Drokin (JIRA) wrote:

            ::::::::::::::::::::::::::::::::::::::::::::::::::::::
            Jeffrey T. Frey, Ph.D.
            Systems Programmer IV / Cluster Management
            Network & Systems Services / College of Engineering
            University of Delaware, Newark DE 19716
            Office: (302) 831-6034 Mobile: (302) 419-4976
            http://turin.nss.udel.edu/
            ::::::::::::::::::::::::::::::::::::::::::::::::::::::

            frey@udel.edu Jeffrey Frey added a comment - Interesting that this is a 2.x problem; the fact remains, though, that the symptoms we observe on the client side are EXACTLY the same as the ticket I mentioned, so we were hoping that would provide you folks with a better idea of what's going on. The problem seems like it has to be partly client-oriented, since 185 of the 200 Lustre clients were using the OST without issue while 15 of the clients showed it as unavailable. On the cluster head node where we noticed this recent outage: [root@mills lustre-OST0006-osc-ffff880d13d29000] # pwd /proc/fs/lustre/ldlm/namespaces/lustre-OST0006-osc-ffff880d13d29000 [root@mills lustre-OST0006-osc-ffff880d13d29000] # more lock_* :::::::::::::: lock_count :::::::::::::: 99554 :::::::::::::: lock_unused_count :::::::::::::: 0 Should we read this statement as "use Lustre for applications with a couple large files to access?" There is no single application here since this is a community cluster with many different applications, and Lustre was intended by us to be used as working space for users' jobs on the cluster (a'la scratch). On Jul 12, 2012, at 12:51 PM, Oleg Drokin (JIRA) wrote: :::::::::::::::::::::::::::::::::::::::::::::::::::::: Jeffrey T. Frey, Ph.D. Systems Programmer IV / Cluster Management Network & Systems Services / College of Engineering University of Delaware, Newark DE 19716 Office: (302) 831-6034 Mobile: (302) 419-4976 http://turin.nss.udel.edu/ ::::::::::::::::::::::::::::::::::::::::::::::::::::::
            green Oleg Drokin added a comment -

            I believe your problem does not look like LU-416 which was a client side problem )and the one specific to 2.x releases only at that too).
            You are clearly having a problem on the server side.
            Looks like your disk devices are overloaded with small IO and there is very little Lustre can do about it because apparently thisis what the application in question wants.

            green Oleg Drokin added a comment - I believe your problem does not look like LU-416 which was a client side problem )and the one specific to 2.x releases only at that too). You are clearly having a problem on the server side. Looks like your disk devices are overloaded with small IO and there is very little Lustre can do about it because apparently thisis what the application in question wants.

            Cliff:

            Further research on your Jira site has shown that the issue we're seeing is EXACTLY the situation reported in LU-416:

            http://jira.whamcloud.com/browse/LU-416

            To what version of Lustre does that incident correspond – 1.8.6? – and is it resolved in 1.8.8?

            ::::::::::::::::::::::::::::::::::::::::::::::::::::::
            Jeffrey T. Frey, Ph.D.
            Systems Programmer IV / Cluster Management
            Network & Systems Services / College of Engineering
            University of Delaware, Newark DE 19716
            Office: (302) 831-6034 Mobile: (302) 419-4976
            http://turin.nss.udel.edu/
            ::::::::::::::::::::::::::::::::::::::::::::::::::::::

            frey@udel.edu Jeffrey Frey added a comment - Cliff: Further research on your Jira site has shown that the issue we're seeing is EXACTLY the situation reported in LU-416 : http://jira.whamcloud.com/browse/LU-416 To what version of Lustre does that incident correspond – 1.8.6? – and is it resolved in 1.8.8? :::::::::::::::::::::::::::::::::::::::::::::::::::::: Jeffrey T. Frey, Ph.D. Systems Programmer IV / Cluster Management Network & Systems Services / College of Engineering University of Delaware, Newark DE 19716 Office: (302) 831-6034 Mobile: (302) 419-4976 http://turin.nss.udel.edu/ ::::::::::::::::::::::::::::::::::::::::::::::::::::::
            millerbe@udel.edu Ben Miller added a comment -

            We continue to have temporary OST unavailability with the read cache
            disabled. We haven't had a chance to upgrade to 1.8.8 yet. Sometimes
            the OST will recover within a few minutes other times after several
            minutes (or hours) we end up rebooted an OSS to two to get Lustre
            available again.

            Ben

            millerbe@udel.edu Ben Miller added a comment - We continue to have temporary OST unavailability with the read cache disabled. We haven't had a chance to upgrade to 1.8.8 yet. Sometimes the OST will recover within a few minutes other times after several minutes (or hours) we end up rebooted an OSS to two to get Lustre available again. Ben

            Please update us as to your status. What else we do to help on this issue? Should we close this bug?

            cliffw Cliff White (Inactive) added a comment - Please update us as to your status. What else we do to help on this issue? Should we close this bug?

            People

              cliffw Cliff White (Inactive)
              adizon Archie Dizon
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: