Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.12.1
-
None
-
3
-
9223372036854775807
Description
If net_monitor_thr is stopped with a condition that router checker packet is waiting for retry,
resources for the packet is not released.
As a workaround, we correct to wait for completion of router checker shutdown(TIMEOUT is 10sec x 2). After that, purge retry packet.
Could you discuss how to fix this bug.
diff --git a/lnet/lnet/lib-move.c b/lnet/lnet/lib-move.c index 5e990d9..3b16d89 100644 --- a/lnet/lnet/lib-move.c +++ b/lnet/lnet/lib-move.c @@ -3682,6 +3682,14 @@ void lnet_monitor_thr_stop(void) /* tell the monitor thread that we're shutting down */ wake_up(&the_lnet.ln_mt_waitq); + /* wait tx completion for router checker */ + if (atomic_read(&the_lnet.ln_routers_nsends)) { + set_current_state(TASK_UNINTERRUPTIBLE); + schedule_timeout(cfs_time_seconds(lnet_get_lnd_timeout() * 2)); + } + /* purge resend messages */ + lnet_clean_resendqs(); + /* block until monitor thread signals that it's done */ down(&the_lnet.ln_mt_signal); LASSERT(the_lnet.ln_mt_state == LNET_MT_STATE_SHUTDOWN); @@ -3691,7 +3699,6 @@ void lnet_monitor_thr_stop(void) lnet_rsp_tracker_clean(); lnet_clean_local_ni_recoveryq(); lnet_clean_peer_ni_recoveryq(); - lnet_clean_resendqs(); rc = LNetEQFree(the_lnet.ln_mt_eqh); LASSERT(rc == 0); return;