[LU-2133] Peers always report down for non-routers after LND disconnects Created: 09/Oct/12  Updated: 13/Apr/13  Resolved: 13/Apr/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.3.0, Lustre 2.1.2
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Minor
Reporter: Jeremy Filizetti Assignee: Isaac Huang (Inactive)
Resolution: Fixed Votes: 0
Labels: ptr
Environment:

Originated from LU-630 patch


Severity: 3
Rank (Obsolete): 5135

 Comments   
Comment by Jeremy Filizetti [ 09/Oct/12 ]

After adding the patch from LU-630 peers always report down after the connection is disconnected even after it is reconnected. Even though the peer health wasn't meant as a generic health mechanism as mentioned in LU-630 reporting incorrectly that the a peer is down even after it just reconnected seems counter-productive. Is there any reason that LNDs can't just call lnet_notify to report the link is alive after a the connection is established similar to the following patch:

diff --git a/lnet/klnds/o2iblnd/o2iblnd_cb.c b/lnet/klnds/o2iblnd/o2iblnd_cb.c
index eb567a5..d620cf3 100644
--- a/lnet/klnds/o2iblnd/o2iblnd_cb.c
+++ b/lnet/klnds/o2iblnd/o2iblnd_cb.c
@@ -2060,6 +2060,7 @@ kiblnd_connreq_done(kib_conn_t *conn, int status)
         }
 
         write_unlock_irqrestore(&kiblnd_data.kib_global_lock, flags);
+        lnet_notify(peer->ibp_ni, peer->ibp_nid, 1, cfs_time_current());
 
         /* Schedule blocked txs */
         spin_lock (&conn->ibc_lock);
diff --git a/lnet/klnds/socklnd/socklnd_cb.c b/lnet/klnds/socklnd/socklnd_cb.c
index 56e5ad3..e84a37b 100644
--- a/lnet/klnds/socklnd/socklnd_cb.c
+++ b/lnet/klnds/socklnd/socklnd_cb.c
@@ -1987,9 +1987,12 @@ ksocknal_connect (ksock_route_t *route)
                 }
 
                 ksocknal_launch_connection_locked(route);
+                cfs_write_unlock_bh (&ksocknal_data.ksnd_global_lock);
+        } else {
+                cfs_write_unlock_bh (&ksocknal_data.ksnd_global_lock);
+                lnet_notify(peer->ksnp_ni, peer->ksnp_id.nid, 1, cfs_time_current());
         }
 
-        cfs_write_unlock_bh (&ksocknal_data.ksnd_global_lock);
         return retry_later;
 
  failed:

Unfortunately I only have ksocklnd and ko2iblnd clients so I don't know about other LNDs and can't test them anyways.

Comment by Peter Jones [ 09/Oct/12 ]

Isaac

Could you please comment on this one?

Thanks

Peter

Comment by Isaac Huang (Inactive) [ 19/Mar/13 ]

Hi, did you mean that peers were reported as "down" in /proc/sys/lnet/peers?

The reason why lnet_notify() isn't called from LNDs for "up" is that it was a design choice for LNet to poll LNDs when necessary instead of LNDs interrupting LNet - e.g. when peer health isn't enabled at LNet layer there isn't any polling calls into LNDs at all, and there's other reasons here why polling from upper layer is more efficient than interrupting from lower layer.

If your concern was wrong status shown in /proc/sys/lnet/peers, I think the solution should be to simply show "NA" in /proc/sys/lnet/peers when peer health not enabled - while your patch would enable proper status in the case of reconnects there could be other cases where incorrect status would still show up in /proc/sys/lnet/peers, because without peer health enabled, the aliveness timestamps aren't refreshed at all.

Comment by Isaac Huang (Inactive) [ 19/Mar/13 ]

Patch posted at http://review.whamcloud.com/#change,5770

Comment by Jeremy Filizetti [ 22/Mar/13 ]

Yes, after the peer is goes down it remains down for non-routers after the patch from LU-630. The patch I had here was incomplete but after digging around for the correct place to actually notify LNet it doesn't seem there are any really good places and based on your comments I guess I can see why. The problem with just showing NA is we have a requirement to know who is connected. I lose that ability with the current LU-630 patch so it creates a problem while fixing another.

Instead of just reverting LU-630's patch I think I'd like to see it revised to do something along the lines of returning 1 from lnet_peer_alive_locked if (the_lnet.ln_routing == 0) instead of returning 0 but not before querying the LND.

Comment by Isaac Huang (Inactive) [ 22/Mar/13 ]

1. By reverting the LU-630 patch, you'd essentially enable a feature that does not work except on routers. Then you'd have to make sure that it's disabled everywhere else by module options, otherwise clients and servers would see dropped messages when the network and peers are in good health. In other words, I believe there's a very good reason for the LU-630 patch.

2. "returning 1 from lnet_peer_alive_locked if (the_lnet.ln_routing == 0)" gives false information. "NA" is correct because LNet does not know. If you want to know who is connected, "lctl conn_list" is a better way. It shows active peers at the LND level. /proc/sys/lnet/peers shows peer state in LNet level, and LNet level peer state is persistent, in that even if the LND has closed connection with a peer (e.g. peer is dead), there'd still be an entry for the peer in /proc/sys/lnet/peers.

Comment by Jeremy Filizetti [ 22/Mar/13 ]

Thanks for the clarification on the lnet state. conn_list looks like it will suffice for status information although certainly not as nice of a layout or as simple to get a list of all peers.

Comment by Isaac Huang (Inactive) [ 05/Apr/13 ]

New patch at http://review.whamcloud.com/#change,5955

Comment by Peter Jones [ 13/Apr/13 ]

Landed for 2.4

Generated at Sat Feb 10 01:22:38 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.