[LU-9765] NMI watchdog - OPA <-> IB LNET router Created: 12/Jul/17 Updated: 13/Jul/17 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Cliff White (Inactive) | Assignee: | Amir Shehata (Inactive) |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | soak | ||
| Environment: |
Soak performance cluster |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
We have now two routers, from IB to OPA clients, soak-14/15 |
| Comments |
| Comment by Peter Jones [ 12/Jul/17 ] |
|
Amir is investigating |
| Comment by Dmitry Eremin (Inactive) [ 12/Jul/17 ] |
|
What version of IFS is installed? It looks very similar to issue of IFS which is call schedule() with lock acquired. This is fixed in IFS 10.4 version. |
| Comment by Amir Shehata (Inactive) [ 12/Jul/17 ] |
|
How do I find the IFS version installed? |
| Comment by Dmitry Eremin (Inactive) [ 12/Jul/17 ] |
# cat /etc/opa/version_wrapper |
| Comment by Amir Shehata (Inactive) [ 13/Jul/17 ] |
|
From the core it appears that cpt 1 net lock has been unlocked one too many times: crash> p* the_lnet.ln_net_lock->pcl_locks[0]
$27 = {
{
rlock = {
raw_lock = {
{
head_tail = 171313694,
tickets = {
head = 2590,
tail = 2614
}
}
}
}
}
}
crash> p* the_lnet.ln_net_lock->pcl_locks[1]
$28 = {
{
rlock = {
raw_lock = {
{
head_tail = 2492241038,
tickets = {
head = 38030,
tail = 38028
}
}
}
}
}
}
We're currently suspecting an issue on the routing path, although a code inspection didn't reveal anything obvious. We're continuing to investigate. In the mean time Dmitry installed IFS 10.4 on the routers (soak-14 and soak-15), to avoid running into the HFI bug which leads to a deadlock. Next time when we run the soak tests using the routers can we turn on net error logging: lctl set_param debug=+neterror Then turn on the debug daemon to capture any relevant logs during the test run. |
| Comment by Dmitry Eremin (Inactive) [ 13/Jul/17 ] |
|
Maybe |
| Comment by Amir Shehata (Inactive) [ 13/Jul/17 ] |
|
After looking at the core, it appears that all the CPUs are stuck on CPT 0. |