Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.12.1
-
None
-
3
-
9223372036854775807
Description
If net_monitor_thr is stopped with a condition that router checker packet is waiting for retry,
resources for the packet is not released.
As a workaround, we correct to wait for completion of router checker shutdown(TIMEOUT is 10sec x 2). After that, purge retry packet.
Could you discuss how to fix this bug.
diff --git a/lnet/lnet/lib-move.c b/lnet/lnet/lib-move.c
index 5e990d9..3b16d89 100644
--- a/lnet/lnet/lib-move.c
+++ b/lnet/lnet/lib-move.c
@@ -3682,6 +3682,14 @@ void lnet_monitor_thr_stop(void)
/* tell the monitor thread that we're shutting down */
wake_up(&the_lnet.ln_mt_waitq);
+ /* wait tx completion for router checker */
+ if (atomic_read(&the_lnet.ln_routers_nsends)) {
+ set_current_state(TASK_UNINTERRUPTIBLE);
+ schedule_timeout(cfs_time_seconds(lnet_get_lnd_timeout() * 2));
+ }
+ /* purge resend messages */
+ lnet_clean_resendqs();
+
/* block until monitor thread signals that it's done */
down(&the_lnet.ln_mt_signal);
LASSERT(the_lnet.ln_mt_state == LNET_MT_STATE_SHUTDOWN);
@@ -3691,7 +3699,6 @@ void lnet_monitor_thr_stop(void)
lnet_rsp_tracker_clean();
lnet_clean_local_ni_recoveryq();
lnet_clean_peer_ni_recoveryq();
- lnet_clean_resendqs();
rc = LNetEQFree(the_lnet.ln_mt_eqh);
LASSERT(rc == 0);
return;