Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9765

NMI watchdog - OPA <-> IB LNET router

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.10.0
    • Soak performance cluster
    • 3
    • 9223372036854775807

    Description

      We have now two routers, from IB to OPA clients, soak-14/15
      After several hours of running, soak-15 crashed hard. Multiple NMI on multiple CPU.
      Output is somewhat messy
      Crash dump is available on the node. vmcore-dmesg and console log attached

      Attachments

        1. soak-15.console.gz
          230 kB
          Cliff White
        2. vmcore-dmesg.txt
          993 kB
          Cliff White

        Issue Links

          Activity

            [LU-9765] NMI watchdog - OPA <-> IB LNET router

            After looking at the core, it appears that all the CPUs are stuck on CPT 0. LU-9769 would have an impact if an older lnetctl was used to delete a net, but provided the wrong net_id. So it could potentially be an issue in that scenario. It would be good to run with that patch just in case..

            ashehata Amir Shehata (Inactive) added a comment - After looking at the core, it appears that all the CPUs are stuck on CPT 0. LU-9769 would have an impact if an older lnetctl was used to delete a net, but provided the wrong net_id. So it could potentially be an issue in that scenario. It would be good to run with that patch just in case..

            Maybe LU-9769 relate to this.

            dmiter Dmitry Eremin (Inactive) added a comment - Maybe  LU-9769 relate to this.

            From the core it appears that cpt 1 net lock has been unlocked one too many times:

            crash> p* the_lnet.ln_net_lock->pcl_locks[0]
            $27 = {
              {
                rlock = {
                  raw_lock = {
                    {
                      head_tail = 171313694, 
                      tickets = {
                        head = 2590, 
                        tail = 2614
                      }
                    }
                  }
                }
              }
            }
            crash> p* the_lnet.ln_net_lock->pcl_locks[1]
            $28 = {
              {
                rlock = {
                  raw_lock = {
                    {
                      head_tail = 2492241038, 
                      tickets = {
                        head = 38030, 
                        tail = 38028
                      }
                    }
                  }
                }
              }
            }
            

            We're currently suspecting an issue on the routing path, although a code inspection didn't reveal anything obvious. We're continuing to investigate.

            In the mean time Dmitry installed IFS 10.4 on the routers (soak-14 and soak-15), to avoid running into the HFI bug which leads to a deadlock.

            Next time when we run the soak tests using the routers can we turn on net error logging:

            lctl set_param debug=+neterror
            

            Then turn on the debug daemon to capture any relevant logs during the test run.

            ashehata Amir Shehata (Inactive) added a comment - From the core it appears that cpt 1 net lock has been unlocked one too many times: crash> p* the_lnet.ln_net_lock->pcl_locks[0] $27 = { { rlock = { raw_lock = { { head_tail = 171313694, tickets = { head = 2590, tail = 2614 } } } } } } crash> p* the_lnet.ln_net_lock->pcl_locks[1] $28 = { { rlock = { raw_lock = { { head_tail = 2492241038, tickets = { head = 38030, tail = 38028 } } } } } } We're currently suspecting an issue on the routing path, although a code inspection didn't reveal anything obvious. We're continuing to investigate. In the mean time Dmitry installed IFS 10.4 on the routers (soak-14 and soak-15), to avoid running into the HFI bug which leads to a deadlock. Next time when we run the soak tests using the routers can we turn on net error logging: lctl set_param debug=+neterror Then turn on the debug daemon to capture any relevant logs during the test run.
            # cat /etc/opa/version_wrapper
            
            dmiter Dmitry Eremin (Inactive) added a comment - # cat /etc/opa/version_wrapper

            How do I find the IFS version installed?

            ashehata Amir Shehata (Inactive) added a comment - How do I find the IFS version installed?

            What version of IFS is installed? It looks very similar to issue of IFS which is call schedule() with lock acquired. This is fixed in IFS 10.4 version.

            dmiter Dmitry Eremin (Inactive) added a comment - What version of IFS is installed? It looks very similar to issue of IFS which is call schedule() with lock acquired. This is fixed in IFS 10.4 version.
            pjones Peter Jones added a comment -

            Amir is investigating

            pjones Peter Jones added a comment - Amir is investigating

            People

              ashehata Amir Shehata (Inactive)
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated: