We also see many messages like this:
out.nbp2-oss18.1452913951.gz.denum:
00000800:00000200:15.0:1452913946.993806:0:21340:0:(o2iblnd.c:1898:kiblnd_pool_alloc_node()) Another thread is allocating new TX pool, waiting 1024 HZs for her to complete.trips = 83498830
This was part of a patch generated in https://jira.hpdd.intel.com/browse/LU-7054
http://review.whamcloud.com/#/c/16470/2/lnet/klnds/o2iblnd/o2iblnd.c
but we still see that there are a large number of "complete.trips" through. I has assumed that the "waiting HZs" of 1024 would slow this down, or does it simply schedule other threads if one waiting and not sleep (unclear to me), but in the traces I've looked at, I dont see any new pools being successfully created (and the indication of how long pool creation took to complete).
You must forgive me, grasping a little from memory... I seem to recall that there were some competition between the freeing (unregister) and pool allocation, is it possible that a something slow in the deallocation prevents new pools from being created?
Also, since I'm not familiar with this code (and I'm looking at this on my apple watch)
the "schedule_timeout(interval)", mapped to an inline null function. So I couldn't decipher yet.
It looks like http://review.whamcloud.com/#/c/18025/ is a backport of patch
LU-7569http://review.whamcloud.com/#/c/17892/ that I have been asking for. Thanks.