Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8250

MDT recovery stalled on secondary node

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.9.0
    • Lustre 2.9.0
    • lola
      build: commit aa84fbc8165f526dae4bd824a48c186c3ac2f639 + patches
    • 3
    • 9223372036854775807

    Description

      The error happened during soak testing of build '20160601' (see: https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160601). DNE is enabled. MDT have been formatted using ldiskfs, OSTs using zfs. MDSes host one MDT per node. MDSes are configured in active-active failover configuration.

      The issue might be related to LU-7848 although the change is part of build under test.
      The error results in the start of the oom-killer, which is documented in LU-7836. This ticket might be a duplicate of LU-7836.

      Events:
      1st Event:

      • 2016-06-03 11:31:10 - failover resource of lola-10 (MDT-2) --> lola-11
      • 2016-06-03 11:36:37 - ... soaked-MDT0002 mounted successfully on lola-11
      • till 2016-06-04-00:44 - soaked-MDT0002 in status 'RECOVERING'.
      • 2016-06-04-00:44:52 - lola-11 crash with oom-killer

      2nd Event:

      • 2016-06-07 08:34:06,621 triggering fault mds_failover lola-10 (MDT-2) --> lola-11
      • 2016-06-07 08:38:42 - Mounting soaked-MDT0002 on lola-11
      • since 2016-06-07 08:39:32,155 Wait for recovery to complete
      • memory resources are nearly exhausted:
        [root@lola-11 ~]# date
        Wed Jun  8 07:59:49 PDT 2016
        [root@lola-11 ~]# collectl -sm --verbose
        waiting for 1 second sample...
        
        # MEMORY SUMMARY
        #<-------------------------------Physical Memory--------------------------------------><-----------Swap------------><-------Paging------>
        #   Total    Used    Free    Buff  Cached    Slab  Mapped    Anon  Commit  Locked Inact Total  Used  Free   In  Out Fault MajFt   In  Out
           32006M  30564M   1441M 127144K 676256K  28701M  16196K  69072K 201740K   5008K  509M   15G     0   15G    0    0    28     0    0    8
           32006M  30565M   1441M 127144K 676256K  28701M  16196K  69072K 201740K   5008K  509M   15G     0   15G    0    0    63     0    0    4
           32006M  30565M   1441M 127144K 676256K  28701M  16196K  69072K 201740K   5008K  509M   15G     0   15G    0    0     1     0    0    0
           32006M  30564M   1441M 127144K 676256K  28701M  16196K  69072K 201740K   5008K  509M   15G     0   15G    0    0    17     0    0    0
        

        Attached files:
        *1st event only: Saved crash dump file to lhn.hpdd.intel.com:/var/crashdumps/lu-7836/lola-11/127.0.0.1-2016-06-04-00:44:52

      • 2nd event only: kernel debug log of lola-11, dmesg
      • Both event: messages, console logs

      Attachments

        Issue Links

          Activity

            People

              tappro Mikhail Pershin
              heckes Frank Heckes (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: