[LU-9765] NMI watchdog - OPA <-> IB LNET router Created: 12/Jul/17  Updated: 13/Jul/17

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Cliff White (Inactive) Assignee: Amir Shehata (Inactive)
Resolution: Unresolved Votes: 0
Labels: soak
Environment:

Soak performance cluster


Attachments: File soak-15.console.gz     Text File vmcore-dmesg.txt    
Issue Links:
Related
is related to LU-9769 Exit from function with acquired lock... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

We have now two routers, from IB to OPA clients, soak-14/15
After several hours of running, soak-15 crashed hard. Multiple NMI on multiple CPU.
Output is somewhat messy
Crash dump is available on the node. vmcore-dmesg and console log attached



 Comments   
Comment by Peter Jones [ 12/Jul/17 ]

Amir is investigating

Comment by Dmitry Eremin (Inactive) [ 12/Jul/17 ]

What version of IFS is installed? It looks very similar to issue of IFS which is call schedule() with lock acquired. This is fixed in IFS 10.4 version.

Comment by Amir Shehata (Inactive) [ 12/Jul/17 ]

How do I find the IFS version installed?

Comment by Dmitry Eremin (Inactive) [ 12/Jul/17 ]
# cat /etc/opa/version_wrapper
Comment by Amir Shehata (Inactive) [ 13/Jul/17 ]

From the core it appears that cpt 1 net lock has been unlocked one too many times:

crash> p* the_lnet.ln_net_lock->pcl_locks[0]
$27 = {
  {
    rlock = {
      raw_lock = {
        {
          head_tail = 171313694, 
          tickets = {
            head = 2590, 
            tail = 2614
          }
        }
      }
    }
  }
}
crash> p* the_lnet.ln_net_lock->pcl_locks[1]
$28 = {
  {
    rlock = {
      raw_lock = {
        {
          head_tail = 2492241038, 
          tickets = {
            head = 38030, 
            tail = 38028
          }
        }
      }
    }
  }
}

We're currently suspecting an issue on the routing path, although a code inspection didn't reveal anything obvious. We're continuing to investigate.

In the mean time Dmitry installed IFS 10.4 on the routers (soak-14 and soak-15), to avoid running into the HFI bug which leads to a deadlock.

Next time when we run the soak tests using the routers can we turn on net error logging:

lctl set_param debug=+neterror

Then turn on the debug daemon to capture any relevant logs during the test run.

Comment by Dmitry Eremin (Inactive) [ 13/Jul/17 ]

Maybe LU-9769 relate to this.

Comment by Amir Shehata (Inactive) [ 13/Jul/17 ]

After looking at the core, it appears that all the CPUs are stuck on CPT 0. LU-9769 would have an impact if an older lnetctl was used to delete a net, but provided the wrong net_id. So it could potentially be an issue in that scenario. It would be good to run with that patch just in case..

Generated at Sat Feb 10 02:29:02 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.