Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Upstream
-
None
-
3
-
9223372036854775807
Description
The upstream version of lnet-selftest is triggering a node crash when it is loaded. We know that a kernel developer changed the definition of kiov in both LNet and lnet-selftest. The crash may be related. In one run, I saw this log before crashing:
LNet: 16216:0:(framework.c:1712:sfw_startup()) Failed to reserve enough buffers: service debug, 256 needed: -30720
This may be a "hint" we are running out of memory and thereby causing instability in the kernel. It may be the new kiov system is causing memory exhaustion especially when allocating per-CPT.
Oleg figured this out and submitted a patch upstream. I cannot find the "send-email" email right now, but will post it here when I do.
Once it is landed, this ticket will be marked as resolved.
The problem did not affect master so no patch is needed there.