Details
-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
Lustre 2.16.0, Lustre 2.12.9
-
None
-
3
-
9223372036854775807
Description
Attempting to collect detailed data for some investigation, I noticed frequent reports from debug daemon that the buffer is overflowing, but then the actual statistic caught my eye:
[1243871.266883] debug daemon buffer overflowed; discarding 10% of pages (1 of 1) [1243871.270173] debug daemon buffer overflowed; discarding 10% of pages (1 of 0)
So this is in effect telling us that tcd->tcd_cur_pages is 0 (or 1) while I know the tcd->tcd_max_pages cannot be any less that 1500.
Which in turn means we are having an allocation failure:
if (tcd->tcd_cur_pages < tcd->tcd_max_pages) {
if (tcd->tcd_cur_stock_pages > 0) {
tage = cfs_tage_from_list(tcd->tcd_stock_pages.prev);
--tcd->tcd_cur_stock_pages;
list_del_init(&tage->linkage);
} else {
tage = cfs_tage_alloc(GFP_ATOMIC);
if (unlikely(tage == NULL)) {
if ((!memory_pressure_get() ||
in_interrupt()) && printk_ratelimit())
printk(KERN_WARNING
"cannot allocate a tage (%ld)\n",
tcd->tcd_cur_pages);
return NULL;
}
}
but there's no printk which is a bit puzzling. Perhaps just due to mem_pressure_get() returning 1? And that would be due to increased memory pressure from caching on OSSes?
There were other anecdotal reports and observations that lctl dk tends to only have a lot of old data and very little new messages when not doing the debug daemon that could be explained by the same effect?
This is also hand-in hand with LU-15916 where the supposed reserve pages for debug buffer use are never filled.
Sounds like we need to reinstate some sort of preallocated pages list(s).
Basically the way I see it is:
- We will retain the current "allocate first" logic but if the allocation fails we will use a page from an emergency buffer that's always preallocated at some level
- We reinstate the stock pages thing, will likely need TCD_STOCK_PAGES updated to something smaller as currently it's hardcoded to TCD_STOCK_PAGES = 5 megabytes.
Both approaches would need to ensure we call the buffer refill from somewhere once the pages are actually consumed.
Additionally I guess we can try to divine when a non-atomic allocation is possible and actually perform that one? That would have a potential performance impact though and is less desirable? All in all having preallocated pages in some form seems the most efficient?
Also while we are looking into this dusty corner, perhaps we finally could do something with the arbitrary 80/10/10 split of pages for debug buffers. Restoring close to full LRU behavior seems desirable as long as we can actually achieve it without too much locking.
Something along the lines of "allocate pages with abandon until we hit debug_mb value, discard oldest 10% once the limit is met". Of course we need to figure out how to actually find the oldest 10% of pages efficiently. Having a single list with corresponding locking likely going to be pretty expensive and negates the whole per-cpu stuff in place.
Alternatively we can just iterate all TCDs from a separate task as we are getting close and discarding pages, but that has own complication like comparing oldest pages in different TCDs to see which ones are to be dropped and so on.
I wonder if neilb has any smart ideas here by any chance?
Of course the radical alternative is to get rid of this all and actually do convert to the tracepoints, but the problem here is we were not able to get the functionality we wanted from that in the past despite several attempts by simmonsja