Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7836

MDSes crashed with oom-killer

    XMLWordPrintable

Details

    • 3
    • 9223372036854775807

    Description

      Error occurred during soak testing of build '20160302' (b2_8 RC4) (see: https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160302 also). DNE is enabled. MDTs had been formatted using ldiskfs, OSTs using zfs. MDS nodes are configured in active - active HA failover configuration. (For teset set-up configuration see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-Configuration)

      Note: This might be a duplicate of LU-7780

      Sequence of events:

      • 2016-03-01 20:38:40 triggering fault mds_failover (lola-10 --> lola-11)
      • 2016-03-01 20:41:58 lola-8 LNet process hang on lola-8
        LNet: Service thread pid 5074 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes to /tmp/lustre-log.1456893718.5074
        

        (see attached file lustre-log.1456893718.5074

      • Lustre Recovery never completed (till next failover)
      • All slurm jobs are stalled and began to timeout after ~ 20:40
        Slabs continuously allocated memory. Top 10 consumers before next
        failover are
        slab-details/size-1048576.dat:20160301 22:09:20 size-1048576 13215 13856931840 13215 13856931840 13215 13856931840 13215 13856931840 67108864 0
        slab-details/size-262144.dat:20160301 22:09:20 size-262144 385 100925440 385 100925440 385 100925440 385 100925440 0 0
        slab-details/size-192.dat:20160301 22:09:20 size-192 268155 51485760 268320 51517440 13416 54951936 13416 54951936 24576 0
        slab-details/size-8192.dat:20160301 22:09:20 size-8192 5508 45121536 5508 45121536 5508 45121536 5508 45121536 0 0
        slab-details/size-1024.dat:20160301 22:09:20 size-1024 41857 42861568 41864 42868736 10466 42868736 10466 42868736 110592 0
        slab-details/ptlrpc_cache.dat:20160301 22:09:20 ptlrpc_cache 35590 27333120 35605 27344640 7121 29167616 7121 29167616 122880 0
        slab-details/size-65536.dat:20160301 22:09:20 size-65536 361 23658496 361 23658496 361 23658496 361 23658496 0 0
        slab-details/size-512.dat:20160301 22:09:20 size-512 37639 19271168 38920 19927040 4865 19927040 4865 19927040 45056 0
        slab-details/kmem_cache.dat:20160301 22:09:20 kmem_cache 289 9506944 289 9506944 289 18939904 289 18939904 0 0
        slab-details/size-4096.dat:20160301 22:09:20 size-4096 2970 12165120 2971 12169216 2970 12165120 2971 12169216 4096 0
        slab-details/inode_cache.dat:20160301 22:09:20 inode_cache 15612 9242304 15678 9281376 2613 10702848 2613 10702848 0 0
        
      • 2016-03-01 22:09:46,199:fsmgmt.fsmgmt:INFO triggering fault mds_failover (lola-10 --> lola-11)
      • 2016-03-01 22:16:19,483:fsmgmt.fsmgmt:INFO mds_failover just completed
      • Recovery of MDTs (mdt-4,5) never completed
      • 2016-03-02 00:15-30 Created stack traces and debug log files on lola-[8,10,11]
        *2016-02-02 01:30:01 approximately at this time oom-killer started on MDS lola-10
        Top 10 consumers are:
        slab-details/size-1048576.dat:20160302 01:32:00 size-1048576 29882 31333548032 29882 31333548032 29882 31333548032 29882 31333548032 1048576 0
        slab-details/size-262144.dat:20160302 01:32:00 size-262144 641 168034304 641 168034304 641 168034304 641 168034304 0 0
        slab-details/size-1024.dat:20160302 01:32:00 size-1024 84449 86475776 84484 86511616 21121 86511616 21121 86511616 106496 0
        slab-details/ptlrpc_cache.dat:20160302 01:32:00 ptlrpc_cache 95492 73337856 95510 73351680 19102 78241792 19102 78241792 114688 0
        slab-details/size-192.dat:20160302 01:32:00 size-192 298834 57376128 299260 57457920 14963 61288448 14963 61288448 0 0
        slab-details/size-8192.dat:20160302 01:32:00 size-8192 5862 48021504 5862 48021504 5862 48021504 5862 48021504 -8192 0
        slab-details/size-512.dat:20160302 01:32:00 size-512 77790 39828480 79096 40497152 9879 40464384 9887 40497152 28672 0
        slab-details/size-65536.dat:20160302 01:32:00 size-65536 361 23658496 361 23658496 361 23658496 361 23658496 0 0
        slab-details/kmem_cache.dat:20160302 01:32:00 kmem_cache 289 9506944 289 9506944 289 18939904 289 18939904 0 0
        slab-details/size-128.dat:20160302 01:32:00 size-128 78707 10074496 100920 12917760 3364 13778944 3364 13778944 0 0
        

        *2016-02-03 03:00 MDS lola-8 also crashed with oom-killer

      Attached messages, console and debug logs of nodes lola-8,10,11,
      file recovery-status-20160302 showing recovery and process status around 2016-03-02 00:15, memory and detailed slab,counters of lola-10

      Attachments

        1. console-lola-10.bz2
          363 kB
        2. console-lola-11.bz2
          344 kB
        3. console-lola-11.log-20160608.bz2
          63 kB
        4. console-lola-8.bz2
          523 kB
        5. console-lola-8.log-20160318.bz2
          494 kB
        6. dmesg-lola-11-20160609-0811.bz2
          26 kB
        7. lola-10-lustre-log.20160302-0008.bz2
          4.51 MB
        8. lola-10-memory-counter-2214-0130.dat.bz2
          18 kB
        9. lola-10-slab-detailed-2214-0130.dat.bz2
          595 kB
        10. lola-11-lustre-log.20160302-0008.bz2
          1.33 MB
        11. lola-11-lustre-log.20160608-0656.bz2
          0.3 kB
        12. lola-11-mem-counter-20160311_1729-1910.dat.bz2
          11 kB
        13. lola-11-memory-debug-messages.20160311-183459.bz2
          0.3 kB
        14. lola-8-lustre-log.20160302-0155.bz2
          4.42 MB
        15. lola-8-lustre-log-20160318-0134.bz2
          1.04 MB
        16. lola-8-lustre-log-20160318-0145.bz2
          1.35 MB
        17. lola-8-lustre-log-20160318-0147.bz2
          885 kB
        18. lola-8-lustre-log-20160318-0150.bz2
          938 kB
        19. lola-8-lustre-log-20160318-0153.bz2
          933 kB
        20. lola-8-lustre-log-20160318-0156.bz2
          1.02 MB
        21. lola-8-lustre-log-20160318-0200.bz2
          1012 kB
        22. lola-8-lustre-log-20160318-0204.bz2
          1.00 MB
        23. lola-8-slab-detailed-counter-20160318.tar.bz2
          895 kB
        24. lustre-log.1456893718.5074.bz2
          0.2 kB
        25. messages-lola-10.bz2
          237 kB
        26. messages-lola-11.bz2
          171 kB
        27. messages-lola-11.log-20160608.bz2
          311 kB
        28. messages-lola-8.bz2
          232 kB
        29. messages-lola-8.log-20160318.bz2
          729 kB
        30. recovery-status-20160302
          4 kB
        31. slab-details-2041-2208-one-file-per-slab.tar.bz2
          211 kB
        32. slab-details-2214-0130-one-file-per-slab.tar.bz2
          433 kB
        33. slab-sorted-alloaction-2041-2208.dat.bz2
          4 kB
        34. slab-sorted-alloaction-2214-0130.dat.bz2
          4 kB

        Issue Links

          Activity

            People

              di.wang Di Wang
              heckes Frank Heckes (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: