Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-450

System unresponsive and hitting LBUG() after 1.6 => 1.8 upgrade

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Duplicate
    • Icon: Major Major
    • None
    • Lustre 1.8.6
    • 3
    • 6597

      After a site visit yesterday to upgrade from 1.6 to 1.8 the filesystem is now unstable with 'cat /proc/fs/lustre/health_check' on the OSSs taking up to 18 minutes to complete, a system load of 200+ on the OSSs and the evening several LBUGS()

      Yesterday we upgraded from 1.6.7.2 to 1.8.4.ddn3.1, configured quotas on the system and fixed a issue with LAST_ID on ost_12 which was causing it to set as inactive at start.

      It's possible that the update is a red herring, we first had problems with heartbeat restarting the MDS last Thursday, it started taking too long to read health_check on the MDS around 3am last Tuesday morning, at this time I restarted all servers and it was OK again until Friday, it was however restarting every 1/2 hour over the weekend. We didn't do anything Monday because of the site shutdown and upgrade scheduled for Tuesday.

      Also - since the restart the OSTs have been filling up at an alarming rate, they've gone from ~70% up to 100% in some cases, I'm speaking to the customer to see if this is real data and if they can stem the tide somehow.

            bobijam Zhenyu Xu
            ihara Shuichi Ihara (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: