Details

    • Bug
    • Resolution: Duplicate
    • Blocker
    • Lustre 2.8.0
    • None
    • lola
      build: tip of master(df6cf859bbb29392064e6ddb701f3357e01b3a13) + patches
    • 3
    • 9223372036854775807

    Description

      The error occurred during soak testing of build '20151113' (see https://wiki.hpdd.intel.com/pages/viewpage.action?title=Soak+Testing+on+Lola&spaceKey=Releases#SoakTestingonLola-20151113) and earlier already when testing build '20151109'.
      DNE is enabled. OSTs had been formatted using zfs, MDTs using ldiskfs. MDS nodes are configured in HA active-active failover configuration.

      At three moments in time:

      date node build ID soak event  
      Nov 9 18:10:01 lola-9 build: 20151109 no fault; only job execution
      Nov 13 14:30:02 lola-10 build 20151113 during stopping of soak
      Nov 14 05:35:01 lola-11 build 20151113 no fault ; only job execution
      Nov 14 05:45:01 {{ lola-9}} build 20151113 no fault ; only job execution

      the oom - killer had been invoked on the nodes specified. (All events happened at times where no fault was injected.)

      Attached files: console and syslog of nodes affected.

      Unfortunately collectl wasn't running to gather performance counters.
      The tool has been enabled on all soak nodes to be able get memory, especially slab stats during one of the next sessions.

      Attachments

        1. console-lola-10.log.gz
          405 kB
          Frank Heckes
        2. console-lola-11.log.gz
          619 kB
          Frank Heckes
        3. console-lola-9.log.gz
          880 kB
          Frank Heckes
        4. messages-lola-10.log.bz2
          790 kB
          Frank Heckes
        5. messages-lola-11.log.bz2
          805 kB
          Frank Heckes
        6. messages-lola-9.log.bz2
          659 kB
          Frank Heckes

        Issue Links

          Activity

            [LU-7432] oom-killer started on MDSes
            di.wang Di Wang made changes -
            Resolution New: Duplicate [ 3 ]
            Status Original: Open [ 1 ] New: Resolved [ 5 ]
            di.wang Di Wang made changes -
            Link New: This issue is related to LU-7455 [ LU-7455 ]
            pjones Peter Jones made changes -
            Fix Version/s New: Lustre 2.8.0 [ 11113 ]
            heckes Frank Heckes (Inactive) made changes -
            Attachment New: console-lola-9.log.gz [ 19626 ]
            Attachment New: console-lola-10.log.gz [ 19627 ]
            Attachment New: console-lola-11.log.gz [ 19628 ]
            Attachment New: messages-lola-9.log.bz2 [ 19629 ]
            Attachment New: messages-lola-10.log.bz2 [ 19630 ]
            Attachment New: messages-lola-11.log.bz2 [ 19631 ]
            heckes Frank Heckes (Inactive) made changes -
            Description Original: The error occurred during soak testing of build '20151113' (see https://wiki.hpdd.intel.com/pages/viewpage.action?title=Soak+Testing+on+Lola&spaceKey=Releases#SoakTestingonLola-20151113) and earlier already when testing build '20151109'.
            DNE is enabled. OSTs had been formatted using _zfs_, MDTs using _ldiskfs_. MDS nodes are configured in HA active-active failover configuration.

            At three moments in time:
            || date || node || build ID|| soak event | |
            | Nov 9 18:10:01 |{{lola-9}} | build: 20151109 | no fault; only job execution |
            | Nov 13 14:30:02 | {{lola-10}} | build 20151113| during stopping of soak |
            | Nov 14 05:35:01 -- {{lola-11}} -- build 20151113 | no fault ; only job execution|
            | Nov 14 05:45:01 | {{ lola-9}} | build 20151113 | no fault ; only job execution|
            the oom - killer had been invoked on the nodes specified. (All events happened at times where _no_ fault was injected.)

            Attached files: console and syslog of nodes affected.

            Unfortunately {{collectl}} wasn't running to gather performance counters.
            The tool has been enabled on all soak nodes to be able get memory, especially slab stats during one of the next sessions.

            New: The error occurred during soak testing of build '20151113' (see https://wiki.hpdd.intel.com/pages/viewpage.action?title=Soak+Testing+on+Lola&spaceKey=Releases#SoakTestingonLola-20151113) and earlier already when testing build '20151109'.
            DNE is enabled. OSTs had been formatted using _zfs_, MDTs using _ldiskfs_. MDS nodes are configured in HA active-active failover configuration.

            At three moments in time:
            || date || node || build ID|| soak event | |
            | Nov 9 18:10:01 |{{lola-9}} | build: 20151109 | no fault; only job execution |
            | Nov 13 14:30:02 | {{lola-10}} | build 20151113| during stopping of soak |
            | Nov 14 05:35:01 | {{lola-11}} | build 20151113 | no fault ; only job execution|
            | Nov 14 05:45:01 | {{ lola-9}} | build 20151113 | no fault ; only job execution|
            the oom - killer had been invoked on the nodes specified. (All events happened at times where _no_ fault was injected.)

            Attached files: console and syslog of nodes affected.

            Unfortunately {{collectl}} wasn't running to gather performance counters.
            The tool has been enabled on all soak nodes to be able get memory, especially slab stats during one of the next sessions.

            heckes Frank Heckes (Inactive) created issue -

            People

              wc-triage WC Triage
              heckes Frank Heckes (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: