Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7780

MDS crashed with oom-killer

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • None
    • lola
       build: 2.8.50-6-gf9ca359 ; commit f9ca359284357d145819beb08b316e932f7a3060
    • 3
    • 9223372036854775807

    Description

      Error happened during soak testing of build '20160215' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20150215). DNE is enabled.
      MDT had been formatted using ldiskfs, OSTs using zfs. MDS nodes are configured in active-active HA failover configuration.

      Please note that build 20150215 is a vanilla build of the master brunch.
      This issue might be addressed by the changes included in build '20160210' as we didn't observe this issue in a two day test session.

      • 2016-02-15 15:37:51,169:fsmgmt.fsmgmt:INFO triggering fault mds_failover (for lola-11)
      • 2016-02-15 15:44:57,839:fsmgmt.fsmgmt:INFO mds_failover just completed (for lola-11)
      • After that the slabs memory consumption of slabs continuously increased till all resources are exhausted at 2016-02-15 22:38.
      • Most pages are allocated by size-1048576 slabs. High score list reads as
        #Date Time SlabName ObjInUse ObjInUseB ObjAll ObjAllB SlabInUse SlabInUseB SlabAll SlabAllB SlabChg SlabPct
        20160215 22:46:20 size-1048576 29147 30562844672 29147 30562844672 29147 30562844672 29147 30562844672 0 0
        20160215 22:46:20 size-262144 1793 470024192 1793 470024192 1793 470024192 1793 470024192 0 0
        20160215 22:46:20 ptlrpc_cache 399364 306711552 399380 306723840 79873 327159808 79876 327172096 180224 0
        20160215 22:46:20 size-1024 229179 234679296 229188 234688512 57295 234680320 57297 234688512 -24576 0
        20160215 22:46:20 size-512 256540 131348480 258232 132214784 32278 132210688 32279 132214784 86016 0
        20160215 22:46:20 size-192 460848 88482816 460880 88488960 23043 94384128 23044 94388224 28672 0
        20160215 22:46:20 size-8192 5776 47316992 5776 47316992 5776 47316992 5776 47316992 -8192 0
        20160215 22:46:20 size-128 265120 33935360 266250 34080000 8875 36352000 8875 36352000 0 0
        20160215 22:46:20 size-65536 361 23658496 361 23658496 361 23658496 361 23658496 0 0
        20160215 22:46:20 kmem_cache 289 9506944 289 9506944 289 18939904 289 18939904 0 0
        

        (see attached file slab-usage-by-allocation-descending.dat)

      Attached files messages, console file of lola-11. Sorted slab usage as oom-killer was started.

      Attachments

        1. console-lola-11.log.bz2
          87 kB
          Frank Heckes
        2. lola-11-memory-counter-20160215.dat.bz2
          37 kB
          Frank Heckes
        3. messages-lola-11.log.bz2
          72 kB
          Frank Heckes
        4. slab-details.tar.bz2
          831 kB
          Frank Heckes
        5. slab-usage-by-allocation-descending.dat.bz2
          4 kB
          Frank Heckes

        Issue Links

          Activity

            People

              heckes Frank Heckes (Inactive)
              heckes Frank Heckes (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: