[LU-2133] Peers always report down for non-routers after LND disconnects Created: 09/Oct/12 Updated: 13/Apr/13 Resolved: 13/Apr/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.3.0, Lustre 2.1.2 |
| Fix Version/s: | Lustre 2.4.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Jeremy Filizetti | Assignee: | Isaac Huang (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | ptr | ||
| Environment: |
Originated from |
||
| Severity: | 3 |
| Rank (Obsolete): | 5135 |
| Comments |
| Comment by Jeremy Filizetti [ 09/Oct/12 ] |
|
After adding the patch from diff --git a/lnet/klnds/o2iblnd/o2iblnd_cb.c b/lnet/klnds/o2iblnd/o2iblnd_cb.c index eb567a5..d620cf3 100644 --- a/lnet/klnds/o2iblnd/o2iblnd_cb.c +++ b/lnet/klnds/o2iblnd/o2iblnd_cb.c @@ -2060,6 +2060,7 @@ kiblnd_connreq_done(kib_conn_t *conn, int status) } write_unlock_irqrestore(&kiblnd_data.kib_global_lock, flags); + lnet_notify(peer->ibp_ni, peer->ibp_nid, 1, cfs_time_current()); /* Schedule blocked txs */ spin_lock (&conn->ibc_lock); diff --git a/lnet/klnds/socklnd/socklnd_cb.c b/lnet/klnds/socklnd/socklnd_cb.c index 56e5ad3..e84a37b 100644 --- a/lnet/klnds/socklnd/socklnd_cb.c +++ b/lnet/klnds/socklnd/socklnd_cb.c @@ -1987,9 +1987,12 @@ ksocknal_connect (ksock_route_t *route) } ksocknal_launch_connection_locked(route); + cfs_write_unlock_bh (&ksocknal_data.ksnd_global_lock); + } else { + cfs_write_unlock_bh (&ksocknal_data.ksnd_global_lock); + lnet_notify(peer->ksnp_ni, peer->ksnp_id.nid, 1, cfs_time_current()); } - cfs_write_unlock_bh (&ksocknal_data.ksnd_global_lock); return retry_later; failed: Unfortunately I only have ksocklnd and ko2iblnd clients so I don't know about other LNDs and can't test them anyways. |
| Comment by Peter Jones [ 09/Oct/12 ] |
|
Isaac Could you please comment on this one? Thanks Peter |
| Comment by Isaac Huang (Inactive) [ 19/Mar/13 ] |
|
Hi, did you mean that peers were reported as "down" in /proc/sys/lnet/peers? The reason why lnet_notify() isn't called from LNDs for "up" is that it was a design choice for LNet to poll LNDs when necessary instead of LNDs interrupting LNet - e.g. when peer health isn't enabled at LNet layer there isn't any polling calls into LNDs at all, and there's other reasons here why polling from upper layer is more efficient than interrupting from lower layer. If your concern was wrong status shown in /proc/sys/lnet/peers, I think the solution should be to simply show "NA" in /proc/sys/lnet/peers when peer health not enabled - while your patch would enable proper status in the case of reconnects there could be other cases where incorrect status would still show up in /proc/sys/lnet/peers, because without peer health enabled, the aliveness timestamps aren't refreshed at all. |
| Comment by Isaac Huang (Inactive) [ 19/Mar/13 ] |
|
Patch posted at http://review.whamcloud.com/#change,5770 |
| Comment by Jeremy Filizetti [ 22/Mar/13 ] |
|
Yes, after the peer is goes down it remains down for non-routers after the patch from Instead of just reverting |
| Comment by Isaac Huang (Inactive) [ 22/Mar/13 ] |
|
1. By reverting the 2. "returning 1 from lnet_peer_alive_locked if (the_lnet.ln_routing == 0)" gives false information. "NA" is correct because LNet does not know. If you want to know who is connected, "lctl conn_list" is a better way. It shows active peers at the LND level. /proc/sys/lnet/peers shows peer state in LNet level, and LNet level peer state is persistent, in that even if the LND has closed connection with a peer (e.g. peer is dead), there'd still be an entry for the peer in /proc/sys/lnet/peers. |
| Comment by Jeremy Filizetti [ 22/Mar/13 ] |
|
Thanks for the clarification on the lnet state. conn_list looks like it will suffice for status information although certainly not as nice of a layout or as simple to get a list of all peers. |
| Comment by Isaac Huang (Inactive) [ 05/Apr/13 ] |
|
New patch at http://review.whamcloud.com/#change,5955 |
| Comment by Peter Jones [ 13/Apr/13 ] |
|
Landed for 2.4 |