[LU-15742] lockup in LNetMDUnlink during filesystem migration - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Unresolved
Priority: Critical
Fix Version/s: None
Affects Version/s: Lustre 2.12.8
Labels:
- llnl
Environment:
TOSS 3.7-19 based
RHEL kernel 3.10.0-1160.59.1
lustre 2.12.8_6.llnl

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

We are having significant lnet issues that have caused us to disable lustre on one of our compute clusters (catalyst). We've had to turn off all of the router nodes in that cluster.

When the routers for catalyst are on we see lots of errors and have connectivity problems on multiple clusters.

This ticket may be useful to explain our lnet setup. https://jira.whamcloud.com/browse/LU-15234

UPDATE: The initial issue have been resolved and our clusters and file systems are working and we don't have to turn off clusters and/or routers anymore. This ticket is now focused on the LNetMDUnlink() containing stack trace as a possible root cause. The OS update and underlying network issues we had seem to have been confounders.

Related to https://jira.whamcloud.com/browse/LU-11895

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

call-2022-4-19.tar.gz
943 kB
19/Apr/22 11:12 PM
console.catalyst153
1.96 MB
14/Apr/22 7:57 PM
console.orelic.tar.gz
368 kB
19/Apr/22 6:40 PM
console.orelic2
430 kB
14/Apr/22 7:42 PM
console.pascal128
1.73 MB
22/Apr/22 9:02 PM
console.zrelic.tar.gz
304 kB
19/Apr/22 6:40 PM
console.zrelic2
485 kB
14/Apr/22 8:50 PM
lustre_network_updated.jpg
202 kB
15/Apr/22 3:50 PM
opensm.orelic.log.gz
132 kB
19/Apr/22 6:40 PM
opensm.zrelic.log.gz
194 kB
19/Apr/22 6:41 PM
pascal128-vmcore-dmesg.txt
772 kB
22/Apr/22 9:02 PM
pfstest-nodes.tar.gz
7.51 MB
20/Apr/22 12:20 AM

Activity

People

Assignee:: Serguei Smirnov

Reporter:: Gian-Carlo Defazio

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 14/Apr/22 4:35 PM

Updated:: 22/Jul/22 10:43 PM