Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.12.3
-
None
-
3
-
9223372036854775807
Description
We have a hang on production systems' 2.12.3 where lnet never sets up on server if some routers are bad (hanged on modprobe lustre, no lnet service)
approximative backtrace:
#0 __schedule #1 schedule #2 schedule_timeout #3 lnet_router_post_mt_start #4 lnet_monitor_thr_start #5 LNetNIInit #6 ptlrpc_ni_init #7 ptlrpc_init_portals #8 init_module #9 do_one_initcall #10 load_module #11 sys_finit_module #12 system_call_fastpath
But it's really stuck on the loop checking for rtr->lpni_alive_count to be non-zero on all lnet routers (loop on &the_lnet.ln_routers)
I'm not sure why we don't always get stuck (there always are a couple of routers down on), and it got stuck last time. In the little traces I have left (didn't get a full crash on this one unfortunately), but it looks a lot like LU-13001... except that 2.12.3 doesn't have LU-11297 so the patch for that one doesn't make sense.
OTOH LU-11298 changes that to check (rtr->lp_state & LNET_PEER_DISCOVERED) instead, that sounds like it could be a good idea? I honestly can't say without a dump at hand unfortunately I will need to try to reproduce somewhere more practical....