Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7517

oom killer active after failback of MDS resources

Details

    • Bug
    • Resolution: Won't Fix
    • Blocker
    • None
    • None
    • lola:
      build: tip of master + #31 of change 16383
    • 3
    • 9223372036854775807

    Description

      The error below happens during soak testing of change 16838 patch set #31 (no Wiki entry for build exits, yet) on cluster lola. DNE is enabled and MDSes are configured in active-active HA failover configuration.

      Primary resources of MDT lola-11 were failed back at Dec, 3 20:18.
      The allocation of slabs increased continuously till ~ 31 GB till crash
      MDS node lola-11 crashed with oom-killer at Dec, 4 00:21 (local time). (see also LU-7432)
      ptlrpc_cache seems to be the biggest consumer
      Attached lola-11's messages, console log, vmcore-dmesg file, collectl (version V4.0.2-1) files (for time interval specified above). Also
      attached files containing extracted counters for memory, slab totals and per slab allocation.

      Attachments

        1. console.log.bz2
          190 kB
        2. console-lola-10.log-20151213.gz
          567 kB
        3. console-lola-11.log.bz2
          120 kB
        4. console-lola-9.log-20151213.gz
          913 kB
        5. lola-10-memory-counter-20151213.dat.bz2
          21 kB
        6. lola-10-one-file-per-slab.tar.bz2
          506 kB
        7. lola-10-slab-detail-counter-20151213.dat.bz2
          721 kB
        8. lola-10-slab-global-counter-20151213.dat.bz2
          25 kB
        9. lola-11-memory-counter-20151213.dat.bz2
          60 kB
        10. lola-11-one-file-per-slab.tar.bz2
          1.21 MB
        11. lola-11-slab-detail-counter-20151213.dat.bz2
          1.97 MB
        12. lola-11-slab-global-counter-20151213.dat.bz2
          69 kB
        13. lola-9-memory-counter-20151213.dat.bz2
          38 kB
        14. lola-9-one-file-per-slab.tar.bz2
          813 kB
        15. lola-9-slab-details-counter-20151213.dat.bz2
          1.20 MB
        16. lola-9-slab-global-counter-20151213.dat.bz2
          44 kB
        17. memory-counter-lola-11.dat.bz2
          25 kB
        18. messages-lola-10.log-20151213.bz2
          774 kB
        19. messages-lola-11.log.bz2
          175 kB
        20. messages-lola-11.log.bz2
          302 kB
        21. messages-lola-9.log-20151213.bz2
          490 kB
        22. slab-details-lola-11.dat.bz2
          873 kB
        23. slab-details-one-file-per-slab.tar.bz2
          617 kB
        24. slab-total-lola-11.dat.bz2
          28 kB
        25. vmcore-dmesg.txt.bz2
          28 kB

        Activity

          [LU-7517] oom killer active after failback of MDS resources

          Old issue not reproduced on recent builds

          cliffw Cliff White (Inactive) added a comment - Old issue not reproduced on recent builds
          heckes Frank Heckes (Inactive) added a comment - - edited Error appeared again on a MDT for build '20151214' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20151214 )
          heckes Frank Heckes (Inactive) added a comment - - edited

          The error showed up on all soak MDSes (lola-8 not reported in detail) running soak for build https://build.hpdd.intel.com/job/lustre-reviews/36192/

          • lola-9
            • 20151213 09:52:40 failback MDTS to lola finished
            • Dec 13 16:05:01 lola-9 oom-killer started
          • lola-10
            • 20151213 08:01:40 failback MDTs to lola-10 finished
            • Dec 13 11:40:02 lola-10 oom-killer started
          • lola-11
            • 20151214 02:15:00 failback of MDTs to lola-11 finished
            • Dec 14 12:05:01 lola-11 oom-killer started

          From failback till start of oom-killer the size-1048576 slabs continuously increased and are the biggest memory consumers:
          lola-9

          #Date Time SlabName ObjInUse ObjInUseB ObjAll ObjAllB SlabInUse SlabInUseB SlabAll SlabAllB SlabChg SlabPct
          slab-details/size-1048576.dat:20151213 16:05:40 size-1048576 23122 24245174272 23122 24245174272 23122 24245174272 23122 24245174272 0 0
          slab-details/size-512.dat:20151213 16:05:40 size-512 11151621 5709629952 11152824 5710245888 1394094 5710209024 1394103 5710245888 7835648 0
          slab-details/size-128.dat:20151213 16:05:40 size-128 7076064 905736192 7077360 905902080 235911 966291456 235912 966295552 1376256 0
          slab-details/size-262144.dat:20151213 16:05:40 size-262144 1673 438566912 1673 438566912 1673 438566912 1673 438566912 0 0
          slab-details/ptlrpc_cache.dat:20151213 16:05:40 ptlrpc_cache 376940 289489920 376940 289489920 75388 308789248 75388 308789248 204800 0
          ...
          

          lola-10

          #Date Time SlabName ObjInUse ObjInUseB ObjAll ObjAllB SlabInUse SlabInUseB SlabAll SlabAllB SlabChg SlabPct
          slab-details/size-1048576.dat:20151213 11:40:40 size-1048576 29494 30926700544 29494 30926700544 29494 30926700544 29494 30926700544 0 0
          slab-details/size-262144.dat:20151213 11:40:40 size-262144 1015 266076160 1015 266076160 1015 266076160 1015 266076160 0 0
          slab-details/ptlrpc_cache.dat:20151213 11:40:40 ptlrpc_cache 195920 150466560 195920 150466560 39184 160497664 39184 160497664 167936 0
          slab-details/size-1024.dat:20151213 11:40:40 size-1024 133537 136741888 133540 136744960 33385 136744960 33385 136744960 73728 0
          slab-details/size-512.dat:20151213 11:40:40 size-512 150577 77095424 153896 78794752 19237 78794752 19237 78794752 0 0
          ...
          

          lola-11

          #Date Time SlabName ObjInUse ObjInUseB ObjAll ObjAllB SlabInUse SlabInUseB SlabAll SlabAllB SlabChg SlabPct
          slab-details/size-1048576.dat:20151214 12:05:40 size-1048576 29392 30819745792 29392 30819745792 29392 30819745792 29392 30819745792 0 0
          slab-details/size-262144.dat:20151214 12:05:40 size-262144 1345 352583680 1345 352583680 1345 352583680 1345 352583680 0 0
          slab-details/ptlrpc_cache.dat:20151214 12:05:40 ptlrpc_cache 224232 172210176 224290 172254720 44858 183738368 44858 183738368 57344 0
          slab-details/size-1024.dat:20151214 12:05:40 size-1024 150612 154226688 150632 154247168 37655 154234880 37658 154247168 -20480 0
          slab-details/size-8192.dat:20151214 12:05:40 size-8192 8859 72572928 8859 72572928 8859 72572928 8859 72572928 0 0
          ...
          

          Attached messages, console log files and extracted collectl counters for memory, slab-global, slab-details for each node.

          heckes Frank Heckes (Inactive) added a comment - - edited The error showed up on all soak MDSes (lola-8 not reported in detail) running soak for build https://build.hpdd.intel.com/job/lustre-reviews/36192/ lola-9 20151213 09:52:40 failback MDTS to lola finished Dec 13 16:05:01 lola-9 oom-killer started lola-10 20151213 08:01:40 failback MDTs to lola-10 finished Dec 13 11:40:02 lola-10 oom-killer started lola-11 20151214 02:15:00 failback of MDTs to lola-11 finished Dec 14 12:05:01 lola-11 oom-killer started From failback till start of oom-killer the size-1048576 slabs continuously increased and are the biggest memory consumers: lola-9 #Date Time SlabName ObjInUse ObjInUseB ObjAll ObjAllB SlabInUse SlabInUseB SlabAll SlabAllB SlabChg SlabPct slab-details/size-1048576.dat:20151213 16:05:40 size-1048576 23122 24245174272 23122 24245174272 23122 24245174272 23122 24245174272 0 0 slab-details/size-512.dat:20151213 16:05:40 size-512 11151621 5709629952 11152824 5710245888 1394094 5710209024 1394103 5710245888 7835648 0 slab-details/size-128.dat:20151213 16:05:40 size-128 7076064 905736192 7077360 905902080 235911 966291456 235912 966295552 1376256 0 slab-details/size-262144.dat:20151213 16:05:40 size-262144 1673 438566912 1673 438566912 1673 438566912 1673 438566912 0 0 slab-details/ptlrpc_cache.dat:20151213 16:05:40 ptlrpc_cache 376940 289489920 376940 289489920 75388 308789248 75388 308789248 204800 0 ... lola-10 #Date Time SlabName ObjInUse ObjInUseB ObjAll ObjAllB SlabInUse SlabInUseB SlabAll SlabAllB SlabChg SlabPct slab-details/size-1048576.dat:20151213 11:40:40 size-1048576 29494 30926700544 29494 30926700544 29494 30926700544 29494 30926700544 0 0 slab-details/size-262144.dat:20151213 11:40:40 size-262144 1015 266076160 1015 266076160 1015 266076160 1015 266076160 0 0 slab-details/ptlrpc_cache.dat:20151213 11:40:40 ptlrpc_cache 195920 150466560 195920 150466560 39184 160497664 39184 160497664 167936 0 slab-details/size-1024.dat:20151213 11:40:40 size-1024 133537 136741888 133540 136744960 33385 136744960 33385 136744960 73728 0 slab-details/size-512.dat:20151213 11:40:40 size-512 150577 77095424 153896 78794752 19237 78794752 19237 78794752 0 0 ... lola-11 #Date Time SlabName ObjInUse ObjInUseB ObjAll ObjAllB SlabInUse SlabInUseB SlabAll SlabAllB SlabChg SlabPct slab-details/size-1048576.dat:20151214 12:05:40 size-1048576 29392 30819745792 29392 30819745792 29392 30819745792 29392 30819745792 0 0 slab-details/size-262144.dat:20151214 12:05:40 size-262144 1345 352583680 1345 352583680 1345 352583680 1345 352583680 0 0 slab-details/ptlrpc_cache.dat:20151214 12:05:40 ptlrpc_cache 224232 172210176 224290 172254720 44858 183738368 44858 183738368 57344 0 slab-details/size-1024.dat:20151214 12:05:40 size-1024 150612 154226688 150632 154247168 37655 154234880 37658 154247168 -20480 0 slab-details/size-8192.dat:20151214 12:05:40 size-8192 8859 72572928 8859 72572928 8859 72572928 8859 72572928 0 0 ... Attached messages, console log files and extracted collectl counters for memory, slab-global, slab-details for each node.
          di.wang Di Wang added a comment -

          It looks like most the memory is holden by 1M size slab

          20151204 00:21:00 size-1048576 29758 31203524608 29758 31203524608 29758 31203524608 29758 31203524608 0 0
          
          di.wang Di Wang added a comment - It looks like most the memory is holden by 1M size slab 20151204 00:21:00 size-1048576 29758 31203524608 29758 31203524608 29758 31203524608 29758 31203524608 0 0
          di.wang Di Wang added a comment -

          looks like lola-8 and lola-9 got OOM as well.

          di.wang Di Wang added a comment - looks like lola-8 and lola-9 got OOM as well.

          No debug log files have been written.

          heckes Frank Heckes (Inactive) added a comment - No debug log files have been written.

          The crash dump has been saved to lola-1:/scratch/crashdumps/lu-7517/127.0.0.1-2015-12-04-00\:22\:36.
          It turned out that the collectl raw files are to big to be uploaded to Jira. I saved them to lola-1:/scratch/crashdumps/lu-7517.

          heckes Frank Heckes (Inactive) added a comment - The crash dump has been saved to lola-1:/scratch/crashdumps/lu-7517/127.0.0.1-2015-12-04-00\:22\:36. It turned out that the collectl raw files are to big to be uploaded to Jira. I saved them to lola-1:/scratch/crashdumps/lu-7517.

          People

            wc-triage WC Triage
            heckes Frank Heckes (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: