Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Upstream
-
None
-
3
-
9223372036854775807
Description
The upstream version of lnet-selftest is triggering a node crash when it is loaded. We know that a kernel developer changed the definition of kiov in both LNet and lnet-selftest. The crash may be related. In one run, I saw this log before crashing:
LNet: 16216:0:(framework.c:1712:sfw_startup()) Failed to reserve enough buffers: service debug, 256 needed: -30720
This may be a "hint" we are running out of memory and thereby causing instability in the kernel. It may be the new kiov system is causing memory exhaustion especially when allocating per-CPT.