[LU-13200] hang in lnet_wait_known_routerstate Created: 04/Feb/20  Updated: 05/Feb/20

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.3
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Dominique Martinet (Inactive) Assignee: Peter Jones
Resolution: Unresolved Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

We have a hang on production systems' 2.12.3 where lnet never sets up on server if some routers are bad (hanged on modprobe lustre, no lnet service)

approximative backtrace:

#0 __schedule
#1 schedule
#2 schedule_timeout
#3 lnet_router_post_mt_start
#4 lnet_monitor_thr_start
#5 LNetNIInit
#6 ptlrpc_ni_init
#7 ptlrpc_init_portals
#8 init_module
#9 do_one_initcall
#10 load_module
#11 sys_finit_module
#12 system_call_fastpath

But it's really stuck on the loop checking for rtr->lpni_alive_count to be non-zero on all lnet routers (loop on &the_lnet.ln_routers)

I'm not sure why we don't always get stuck (there always are a couple of routers down on), and it got stuck last time. In the little traces I have left (didn't get a full crash on this one unfortunately), but it looks a lot like LU-13001... except that 2.12.3 doesn't have LU-11297 so the patch for that one doesn't make sense.

OTOH LU-11298 changes that to check (rtr->lp_state & LNET_PEER_DISCOVERED) instead, that sounds like it could be a good idea? I honestly can't say without a dump at hand unfortunately I will need to try to reproduce somewhere more practical....



 Comments   
Comment by Peter Jones [ 05/Feb/20 ]

Thanks Dominque

Generated at Sat Feb 10 02:59:15 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.