Found during IR testing at ORNL.
On MDS startup soon after clients start hitting it, all mdt_xx threads are starting to use all cpu there is.
we tried to sysrq-t and all of them are in grow_rqbd
I checked the code and as soon as the thread is in that state, there is a unbreakable loop, that does 64*numonlinecpus(=16) = 1024 allocations of 16k in size.
the condition to enter there is racy the num posted rqbds < nbuf_group/2
so if 1000 of them would enter there at one time, we have 1000 threads doing 1024 of those allocations
we have kdump log, but it still needs to be transported.