Details
-
Bug
-
Resolution: Cannot Reproduce
-
Blocker
-
None
-
Lustre 2.1.1
-
None
-
3
-
3981
Description
We are experiencing a severe disruption of service on our classified network. Beginning around 5pm yesterday (Friday Jun 29) we began to have many evictions and reconnections across multiple servers and across multiple lustre clusters on our classified network. Some sort of network event may have triggered this, but we're not sure at this point. Lustre servers started getting STONITH'd by their failover partners.
Many servers have been getting panics due to CPU hard lockup. The stack traces aren't helpful; they just show the swapper task handling the watchdog NMI. We're trying to quiesce things by stopping lnet on the compute cluster routers, getting lustre started on all the servers, then bringing client clusters back online one at a time. However, we are still having sporadic CPU hard lockups on OSS's as clients reconnect. Typically the lockup follows several thousand client reconnects.
We have some complete crash dumps to look at, but there's not much to go on. Some process must be leaving interrupts disabled, but I'm not sure how to go about identifying with the data we have.