|
We have a hang on production systems' 2.12.3 where lnet never sets up on server if some routers are bad (hanged on modprobe lustre, no lnet service)
approximative backtrace:
But it's really stuck on the loop checking for rtr->lpni_alive_count to be non-zero on all lnet routers (loop on &the_lnet.ln_routers)
I'm not sure why we don't always get stuck (there always are a couple of routers down on), and it got stuck last time. In the little traces I have left (didn't get a full crash on this one unfortunately), but it looks a lot like LU-13001... except that 2.12.3 doesn't have LU-11297 so the patch for that one doesn't make sense.
OTOH LU-11298 changes that to check (rtr->lp_state & LNET_PEER_DISCOVERED) instead, that sounds like it could be a good idea? I honestly can't say without a dump at hand unfortunately I will need to try to reproduce somewhere more practical....
|