[LU-9072] Upstream lnet-selftest causing node to crash on load Created: 01/Feb/17  Updated: 13/Apr/17  Resolved: 13/Apr/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Upstream
Fix Version/s: Upstream

Type: Bug Priority: Critical
Reporter: Doug Oucharek (Inactive) Assignee: Doug Oucharek (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

The upstream version of lnet-selftest is triggering a node crash when it is loaded.  We know that a kernel developer changed the definition of kiov in both LNet and lnet-selftest.  The crash may be related.  In one run, I saw this log before crashing:

LNet: 16216:0:(framework.c:1712:sfw_startup()) Failed to reserve enough buffers: service debug, 256 needed: -30720

This may be a "hint" we are running out of memory and thereby causing instability in the kernel. It may be the new kiov system is causing memory exhaustion especially when allocating per-CPT.



 Comments   
Comment by Doug Oucharek (Inactive) [ 23/Feb/17 ]

Oleg figured this out and submitted a patch upstream.  I cannot find the "send-email" email right now, but will post it here when I do.  

Once it is landed, this ticket will be marked as resolved.

The problem did not affect master so no patch is needed there.

Generated at Sat Feb 10 02:23:00 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.