Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • None
    • lola
       build: 2.8.50-6-gf9ca359 ; commit f9ca359284357d145819beb08b316e932f7a3060
    • 3
    • 9223372036854775807

    Description

      Error happened during soak testing of build '20160215' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20150215). DNE is enabled.
      MDT had been formatted using ldiskfs, OSTs using zfs. MDS nodes are configured in active-active HA failover configuration.

      Please note that build 20150215 is a vanilla build of the master brunch.
      This issue might be addressed by the changes included in build '20160210' as we didn't observe this issue in a two day test session.

      • 2016-02-15 15:37:51,169:fsmgmt.fsmgmt:INFO triggering fault mds_failover (for lola-11)
      • 2016-02-15 15:44:57,839:fsmgmt.fsmgmt:INFO mds_failover just completed (for lola-11)
      • After that the slabs memory consumption of slabs continuously increased till all resources are exhausted at 2016-02-15 22:38.
      • Most pages are allocated by size-1048576 slabs. High score list reads as
        #Date Time SlabName ObjInUse ObjInUseB ObjAll ObjAllB SlabInUse SlabInUseB SlabAll SlabAllB SlabChg SlabPct
        20160215 22:46:20 size-1048576 29147 30562844672 29147 30562844672 29147 30562844672 29147 30562844672 0 0
        20160215 22:46:20 size-262144 1793 470024192 1793 470024192 1793 470024192 1793 470024192 0 0
        20160215 22:46:20 ptlrpc_cache 399364 306711552 399380 306723840 79873 327159808 79876 327172096 180224 0
        20160215 22:46:20 size-1024 229179 234679296 229188 234688512 57295 234680320 57297 234688512 -24576 0
        20160215 22:46:20 size-512 256540 131348480 258232 132214784 32278 132210688 32279 132214784 86016 0
        20160215 22:46:20 size-192 460848 88482816 460880 88488960 23043 94384128 23044 94388224 28672 0
        20160215 22:46:20 size-8192 5776 47316992 5776 47316992 5776 47316992 5776 47316992 -8192 0
        20160215 22:46:20 size-128 265120 33935360 266250 34080000 8875 36352000 8875 36352000 0 0
        20160215 22:46:20 size-65536 361 23658496 361 23658496 361 23658496 361 23658496 0 0
        20160215 22:46:20 kmem_cache 289 9506944 289 9506944 289 18939904 289 18939904 0 0
        

        (see attached file slab-usage-by-allocation-descending.dat)

      Attached files messages, console file of lola-11. Sorted slab usage as oom-killer was started.

      Attachments

        Issue Links

          Activity

            [LU-7780] MDS crashed with oom-killer
            heckes Frank Heckes (Inactive) made changes -
            Assignee Original: Di Wang [ di.wang ] New: Frank Heckes [ heckes ]
            pjones Peter Jones made changes -
            Fix Version/s Original: Lustre 2.9.0 [ 11891 ]
            Resolution New: Duplicate [ 3 ]
            Status Original: Open [ 1 ] New: Resolved [ 5 ]
            pjones Peter Jones made changes -
            Fix Version/s New: Lustre 2.9.0 [ 11891 ]
            Fix Version/s Original: Lustre 2.8.0 [ 11113 ]
            pjones Peter Jones made changes -
            Priority Original: Blocker [ 1 ] New: Critical [ 2 ]
            adilger Andreas Dilger made changes -
            Link New: This issue is related to LU-7836 [ LU-7836 ]
            adilger Andreas Dilger made changes -
            Fix Version/s New: Lustre 2.8.0 [ 11113 ]
            adilger Andreas Dilger made changes -
            Assignee Original: WC Triage [ wc-triage ] New: Di Wang [ di.wang ]
            heckes Frank Heckes (Inactive) made changes -
            Attachment New: console-lola-11.log.bz2 [ 20390 ]
            Attachment New: lola-11-memory-counter-20160215.dat.bz2 [ 20391 ]
            Attachment New: messages-lola-11.log.bz2 [ 20392 ]
            Attachment New: slab-details.tar.bz2 [ 20393 ]
            Attachment New: slab-usage-by-allocation-descending.dat.bz2 [ 20394 ]
            heckes Frank Heckes (Inactive) made changes -
            Description Original: Error happened during soak testing of build '20160215' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20150215). DNE is enabled.
            MDT had been formatted using ldiskfs, OSTs using zfs. MDS nodes are configured in active-active HA failover configuration.

            Please note that build 20150215 is a vanilla build of the master brunch.
            This issue might be addressed by the changes included in build '20160210' as we didn't observe this issue in a two day test session.

            * 2016-02-15 15:37:51,169:fsmgmt.fsmgmt:INFO triggering fault mds_failover (for {{lola-11}})
            * 2016-02-15 15:44:57,839:fsmgmt.fsmgmt:INFO mds_failover just completed (for {{lola-11}})
            * After that the slabs memory consumption of slabs continuously increased till all resources are exhausted at 2016-02-15 22:38.
            * Most pages are allocated by _size-1048576_ slabs. High score list reads as
            {noformat}
            20160215 22:46:20 size-1048576 29147 30562844672 29147 30562844672 29147 30562844672 29147 30562844672 0 0
            20160215 22:46:20 size-262144 1793 470024192 1793 470024192 1793 470024192 1793 470024192 0 0
            20160215 22:46:20 ptlrpc_cache 399364 306711552 399380 306723840 79873 327159808 79876 327172096 180224 0
            20160215 22:46:20 size-1024 229179 234679296 229188 234688512 57295 234680320 57297 234688512 -24576 0
            20160215 22:46:20 size-512 256540 131348480 258232 132214784 32278 132210688 32279 132214784 86016 0
            20160215 22:46:20 size-192 460848 88482816 460880 88488960 23043 94384128 23044 94388224 28672 0
            20160215 22:46:20 size-8192 5776 47316992 5776 47316992 5776 47316992 5776 47316992 -8192 0
            20160215 22:46:20 size-128 265120 33935360 266250 34080000 8875 36352000 8875 36352000 0 0
            20160215 22:46:20 size-65536 361 23658496 361 23658496 361 23658496 361 23658496 0 0
            20160215 22:46:20 kmem_cache 289 9506944 289 9506944 289 18939904 289 18939904 0 0
            {noformat}
            (see attached file slab-usage-by-allocation-descending.dat)

            Attached files messages, console file of {{lola-11}}. Sorted slab usage as oom-killer was started.
            New: Error happened during soak testing of build '20160215' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20150215). DNE is enabled.
            MDT had been formatted using ldiskfs, OSTs using zfs. MDS nodes are configured in active-active HA failover configuration.

            Please note that build 20150215 is a vanilla build of the master brunch.
            This issue might be addressed by the changes included in build '20160210' as we didn't observe this issue in a two day test session.

            * 2016-02-15 15:37:51,169:fsmgmt.fsmgmt:INFO triggering fault mds_failover (for {{lola-11}})
            * 2016-02-15 15:44:57,839:fsmgmt.fsmgmt:INFO mds_failover just completed (for {{lola-11}})
            * After that the slabs memory consumption of slabs continuously increased till all resources are exhausted at 2016-02-15 22:38.
            * Most pages are allocated by _size-1048576_ slabs. High score list reads as
            {noformat}
            #Date Time SlabName ObjInUse ObjInUseB ObjAll ObjAllB SlabInUse SlabInUseB SlabAll SlabAllB SlabChg SlabPct
            20160215 22:46:20 size-1048576 29147 30562844672 29147 30562844672 29147 30562844672 29147 30562844672 0 0
            20160215 22:46:20 size-262144 1793 470024192 1793 470024192 1793 470024192 1793 470024192 0 0
            20160215 22:46:20 ptlrpc_cache 399364 306711552 399380 306723840 79873 327159808 79876 327172096 180224 0
            20160215 22:46:20 size-1024 229179 234679296 229188 234688512 57295 234680320 57297 234688512 -24576 0
            20160215 22:46:20 size-512 256540 131348480 258232 132214784 32278 132210688 32279 132214784 86016 0
            20160215 22:46:20 size-192 460848 88482816 460880 88488960 23043 94384128 23044 94388224 28672 0
            20160215 22:46:20 size-8192 5776 47316992 5776 47316992 5776 47316992 5776 47316992 -8192 0
            20160215 22:46:20 size-128 265120 33935360 266250 34080000 8875 36352000 8875 36352000 0 0
            20160215 22:46:20 size-65536 361 23658496 361 23658496 361 23658496 361 23658496 0 0
            20160215 22:46:20 kmem_cache 289 9506944 289 9506944 289 18939904 289 18939904 0 0
            {noformat}
            (see attached file slab-usage-by-allocation-descending.dat)

            Attached files messages, console file of {{lola-11}}. Sorted slab usage as oom-killer was started.
            heckes Frank Heckes (Inactive) created issue -

            People

              heckes Frank Heckes (Inactive)
              heckes Frank Heckes (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: