Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7721

lfsck: slab 'size-1048576' exhaust memory (oom-killer)

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Blocker
    • Lustre 2.8.0
    • None
    • lola
      build: master branch, 2.7.65-38-g607f691 ; 607f6919ea67b101796630d4b55649a12ea0e859
    • 3
    • 9223372036854775807

    Description

      The error happened during soak testing of build '20160126' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160126). DNE is enabled.
      MDTs had been formated with ldiskfs, OSTs with zfs.
      No faults were injected during soak test. Only application load and execution of lfsck were imposed on the test cluster.

      Sequence of events:

      • Jan 27 05:44:56 - Started lfsck - command on primary MDS (lola-8):
        lctl lfsck_start -M soaked-MDT0000 -s 1000 -t all -A 
        
      • Jan 27 05:49 - OSS node lola-5 hit LBUG (see LU-7720)
      • Jan 27 08:46 Rebooted lola-5, remounted OSTs, enabled debug for lfsck + increased debug buffer (512MB);
        increasing number of blocked ost_* - threads
        A huge number of debug logs were printed before oom-killer starts:
        Call Trace:
         [<ffffffff8106cc43>] ? dequeue_entity+0x113/0x2e0
         [<ffffffff8152bd26>] __mutex_lock_slowpath+0x96/0x210
         [<ffffffffa0fcbe7b>] ? ofd_seq_load+0xbb/0xa90 [ofd]
         [<ffffffff8152b84b>] mutex_lock+0x2b/0x50
         [<ffffffffa0fbff18>] ofd_create_hdl+0xc28/0x2640 [ofd]
         [<ffffffffa093a66b>] ? lustre_pack_reply_v2+0x1eb/0x280 [ptlrpc]
         [<ffffffffa093a7a6>] ? lustre_pack_reply_flags+0xa6/0x1e0 [ptlrpc]
         [<ffffffffa093a8f1>] ? lustre_pack_reply+0x11/0x20 [ptlrpc]
         [<ffffffffa09a4f9c>] tgt_request_handle+0x8ec/0x1470 [ptlrpc]
         [<ffffffffa094c201>] ptlrpc_main+0xe41/0x1910 [ptlrpc]
         [<ffffffff8152a39e>] ? thread_return+0x4e/0x7d0
         [<ffffffffa094b3c0>] ? ptlrpc_main+0x0/0x1910 [ptlrpc]
         [<ffffffff8109e78e>] kthread+0x9e/0xc0
         [<ffffffff8100c28a>] child_rip+0xa/0x20
         [<ffffffff8109e6f0>] ? kthread+0x0/0xc0
         [<ffffffff8100c280>] ? child_rip+0x0/0x20
        
        LustreError: dumping log to /tmp/lustre-log.1453949036.15397
        Pid: 15443, comm: ll_ost00_065
        

        --> attached this debug log file (/tmp/lustre-log.1453949036.15397)

      • Jan 27 18:45 oom-killer started on OSS node lola-5 + crash 3 mins later
      • Memory exhausted by slab 'size-1048576' with ~ 27GB
        (see archive: lola-5-oom-killer-2.tar.bz2)
      • Jan 28 03:59 - lfsck - command still in not finished (see mds-lfsck-status-nslayout.log.bz2, mds-lfsck-status-oi.log.bz2, oss-lfsck-status.log.bz2)

      Attachments

        1. console-lola-5.log.bz2
          237 kB
        2. lfsck-proc-list.bz2
          0.5 kB
        3. lola-5-oom-killer-2.tar.bz2
          1.27 MB
        4. lustre-log.1453949036.15397.bz2
          1 kB
        5. mds-lfsck-status-nslayout.log.bz2
          2 kB
        6. mds-lfsck-status-oi.log.bz2
          0.8 kB
        7. messages-lola-5.log.bz2
          263 kB
        8. oss-lfsck-status.log.bz2
          1 kB

        Issue Links

          Activity

            People

              yong.fan nasf (Inactive)
              heckes Frank Heckes (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: