Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-450

System unresponsive and hitting LBUG() after 1.6 => 1.8 upgrade

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • Lustre 1.8.6
    • 3
    • 6597

    Description

      After a site visit yesterday to upgrade from 1.6 to 1.8 the filesystem is now unstable with 'cat /proc/fs/lustre/health_check' on the OSSs taking up to 18 minutes to complete, a system load of 200+ on the OSSs and the evening several LBUGS()

      Yesterday we upgraded from 1.6.7.2 to 1.8.4.ddn3.1, configured quotas on the system and fixed a issue with LAST_ID on ost_12 which was causing it to set as inactive at start.

      It's possible that the update is a red herring, we first had problems with heartbeat restarting the MDS last Thursday, it started taking too long to read health_check on the MDS around 3am last Tuesday morning, at this time I restarted all servers and it was OK again until Friday, it was however restarting every 1/2 hour over the weekend. We didn't do anything Monday because of the site shutdown and upgrade scheduled for Tuesday.

      Also - since the restart the OSTs have been filling up at an alarming rate, they've gone from ~70% up to 100% in some cases, I'm speaking to the customer to see if this is real data and if they can stem the tide somehow.

      Attachments

        Issue Links

          Activity

            People

              bobijam Zhenyu Xu
              ihara Shuichi Ihara (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: