Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-667

Experiencing sluggish, intermittently unresponsive, and OOM killed MDS nodes

    XMLWordPrintable

Details

    • Bug
    • Resolution: Won't Fix
    • Blocker
    • None
    • None

    Description

      We are experiencing major MDS problems that are greatly affecting the stability of our Lustre filesystem. We don't have any changes in the fundamental configuration or setup of our storage cluster to point the finger at.

      The general symptoms are that the load on the active MDS node is unusually high and filesystem access hangs intermittently. Logged into the active MDS node we noticed that the command line also intermittently hangs. We noticed that the ptlrpcd process was pegged at 100%+ cpu usage followed by ~50% cpu usage for the kiblnd_sd_* processes. Furthermore, the iowait time is less that 1% while system time ranges from 25%-80%. It sort of appears that the active MDS is spinning as quickly as it can dealing with some kind of RPC traffic coming in over the IB lnd. So far we haven't been able to isolate the traffic involved. In one isolation step we took all the Lnet routers offline feeding in from the compute clusters, and the MDS was still churning vigorously in ptlrpcd and kiblnd processes. Another symptom we are seeing now is that when an MDS node becomes active and start trying to serve clients, we can watch the node rather quickly consume all available memory via Slab allocations and then die an OOM death. Some other observations:

      • OSTs evicting the mdtlov over the IB path
      • fun 'sluggish network' log messages like: ===> Sep 7 12:39:21 amds1 LustreError: 13000:0:(import.c:357:ptlrpc_invalidate_import()) scratch2-OST0053_UUID: RPCs in "Unregistering" phase found (0). Network is sluggish? Waiting them to error out. <===
      • MGS evicting itself over localhost connection: ==> Sep 7 12:39:14 amds1 Lustre: MGS: client 8337bacb-6b62-b0f0-261d-53678e2e56a9 (at 0@lo) evicted, unresponsive for 227s <==

      At this point we have been through 3 or more MDS failover sequences and we also rebooted all the StorP Lustre servers and restarted the filesystem cleanly to see if that would clean things up.

      We have syslog and Lustre debug message logs from various phases of debugging this. I'm not sure at this point what logs will be the most useful, but after I submit this issue I'll attach some files.

      Attachments

        1. amds1.syslog.gz
          38 kB
        2. amds2-meminfo
          0.8 kB
        3. amds2-slabinfo
          19 kB

        Issue Links

          Activity

            People

              cliffw Cliff White (Inactive)
              jbogden jbogden
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: