Details

    • Bug
    • Resolution: Duplicate
    • Blocker
    • Lustre 2.8.0
    • None
    • lola
      build: tip of master(df6cf859bbb29392064e6ddb701f3357e01b3a13) + patches
    • 3
    • 9223372036854775807

    Description

      The error occurred during soak testing of build '20151113' (see https://wiki.hpdd.intel.com/pages/viewpage.action?title=Soak+Testing+on+Lola&spaceKey=Releases#SoakTestingonLola-20151113) and earlier already when testing build '20151109'.
      DNE is enabled. OSTs had been formatted using zfs, MDTs using ldiskfs. MDS nodes are configured in HA active-active failover configuration.

      At three moments in time:

      date node build ID soak event  
      Nov 9 18:10:01 lola-9 build: 20151109 no fault; only job execution
      Nov 13 14:30:02 lola-10 build 20151113 during stopping of soak
      Nov 14 05:35:01 lola-11 build 20151113 no fault ; only job execution
      Nov 14 05:45:01 {{ lola-9}} build 20151113 no fault ; only job execution

      the oom - killer had been invoked on the nodes specified. (All events happened at times where no fault was injected.)

      Attached files: console and syslog of nodes affected.

      Unfortunately collectl wasn't running to gather performance counters.
      The tool has been enabled on all soak nodes to be able get memory, especially slab stats during one of the next sessions.

      Attachments

        1. console-lola-10.log.gz
          405 kB
        2. console-lola-11.log.gz
          619 kB
        3. console-lola-9.log.gz
          880 kB
        4. messages-lola-10.log.bz2
          790 kB
        5. messages-lola-11.log.bz2
          805 kB
        6. messages-lola-9.log.bz2
          659 kB

        Issue Links

          Activity

            [LU-7432] oom-killer started on MDSes
            di.wang Di Wang added a comment -

            duplicate with LU-7039

            di.wang Di Wang added a comment - duplicate with LU-7039
            di.wang Di Wang added a comment -

            I checked the log and also was monitoring the MDS when OOM was about to happen. It seems because of endless recovery on some MDTs. i.e. if recovery abort problem is being fixed, then this problem should go away. Since the endless recovery will be fixed by the patch in LU-7039 and other related patch under LU-7455, I will close this patch.

            Frank, if you see something different, please re-open this one. thanks.

            di.wang Di Wang added a comment - I checked the log and also was monitoring the MDS when OOM was about to happen. It seems because of endless recovery on some MDTs. i.e. if recovery abort problem is being fixed, then this problem should go away. Since the endless recovery will be fixed by the patch in LU-7039 and other related patch under LU-7455 , I will close this patch. Frank, if you see something different, please re-open this one. thanks.

            Sorry, I overlooked Oleg's request.
            For the incident described above we don't have these counters. I enabled collectl on all soak nodes
            last week to gather performance counters, especially for the slap details. We should be prepared to
            replay the stats in case the incident happens again.

            heckes Frank Heckes (Inactive) added a comment - Sorry, I overlooked Oleg's request. For the incident described above we don't have these counters. I enabled collectl on all soak nodes last week to gather performance counters, especially for the slap details. We should be prepared to replay the stats in case the incident happens again.
            pjones Peter Jones added a comment -

            Frank

            When do you think that you will be able to provide the info that Oleg has requested?

            Thanks

            Peter

            pjones Peter Jones added a comment - Frank When do you think that you will be able to provide the info that Oleg has requested? Thanks Peter
            green Oleg Drokin added a comment -

            we need /proc/slabinfo output here to see who's using the ram.

            green Oleg Drokin added a comment - we need /proc/slabinfo output here to see who's using the ram.

            People

              wc-triage WC Triage
              heckes Frank Heckes (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: