Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1543

Lustre Servers - MDS / OSS Died & fail over took over

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • Lustre 2.2.0
    • None
    • 3
    • 6378

    Description

      Dear Support,
      during the weeks we had problem with the MDS/OSS Lustre servers, indeed last week one of the OSS died and fortunately the fail over took over and last night both MDS (see log weissh01.log) and another OSS (see log weiss11.log) died, also in that case the fail over servers took over.
      The problem of yesterday looks related between the two servers, indeed they basically died at the same time around 19.20-30

      As I said the file system remained up and running because the fail over servers took over, but with our old lustre configuration (version 1.8.7 – 1 Mds + 1 Mds Fail over – 4 Oss) also under huge stress and a lot of logging of slow down due to heavy IO load, the MDS or OSS didn't died.

      If you need access to our cluster, please let me know (fverzell@cscs.ch) so that we can organize to create an account.

      Right now we have also a list of ticket that might be related to each other in same aspect, that's the list:

      http://jira.whamcloud.com/browse/LU-1447
      http://jira.whamcloud.com/browse/LU-1455
      http://jira.whamcloud.com/browse/LU-1470
      http://jira.whamcloud.com/browse/LU-1503

      Regards
      Fabio

      Attachments

        1. 19_jun.log
          1.81 MB
        2. 20_jun.log
          918 kB
        3. weiss01.log
          181 kB
        4. weiss11.log
          57 kB

        Activity

          People

            cliffw Cliff White (Inactive)
            fverzell Fabio Verzelloni
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: