[LU-12293] Memory leak after router checker packet processing Created: 13/May/19  Updated: 30/Aug/19

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.1
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Tatsushi Takamura Assignee: Amir Shehata (Inactive)
Resolution: Unresolved Votes: 0
Labels: None

Epic/Theme: lnet
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

If net_monitor_thr is stopped with a condition that router checker packet is waiting for retry,
resources for the packet is not released.

As a workaround, we correct to wait for completion of router checker shutdown(TIMEOUT is 10sec x 2). After that, purge retry packet.
Could you discuss how to fix this bug.

diff --git a/lnet/lnet/lib-move.c b/lnet/lnet/lib-move.c
index 5e990d9..3b16d89 100644
--- a/lnet/lnet/lib-move.c
+++ b/lnet/lnet/lib-move.c
@@ -3682,6 +3682,14 @@ void lnet_monitor_thr_stop(void)
        /* tell the monitor thread that we're shutting down */
        wake_up(&the_lnet.ln_mt_waitq);
 
+       /* wait tx completion for router checker */
+       if (atomic_read(&the_lnet.ln_routers_nsends)) {
+               set_current_state(TASK_UNINTERRUPTIBLE);
+               schedule_timeout(cfs_time_seconds(lnet_get_lnd_timeout() * 2));
+       }
+       /* purge resend messages */
+       lnet_clean_resendqs();
+
        /* block until monitor thread signals that it's done */
        down(&the_lnet.ln_mt_signal);
        LASSERT(the_lnet.ln_mt_state == LNET_MT_STATE_SHUTDOWN);
@@ -3691,7 +3699,6 @@ void lnet_monitor_thr_stop(void)
        lnet_rsp_tracker_clean();
        lnet_clean_local_ni_recoveryq();
        lnet_clean_peer_ni_recoveryq();
-       lnet_clean_resendqs();
        rc = LNetEQFree(the_lnet.ln_mt_eqh);
        LASSERT(rc == 0);
        return;



 Comments   
Comment by Amir Shehata (Inactive) [ 16/May/19 ]

Please take a look at the below patches. They are all part of the multi-rail branch.

https://review.whamcloud.com/#/c/34445/4
https://review.whamcloud.com/#/c/34477/5 <-- would this one fixes the issue on this ticket?
https://review.whamcloud.com/#/c/34252/7
https://review.whamcloud.com/#/c/34607/3
https://review.whamcloud.com/#/c/34770/2
https://review.whamcloud.com/#/c/34771/2
https://review.whamcloud.com/#/c/34778/2
https://review.whamcloud.com/#/c/34796/2
https://review.whamcloud.com/#/c/34798/3
https://review.whamcloud.com/#/c/34885/1

https://review.whamcloud.com/#/c/34477/5

 

Comment by Tatsushi Takamura [ 30/Aug/19 ]

Amir Shehata,

 

Sorry for the late reply. We are going to check these patches.

Generated at Sat Feb 10 02:51:18 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.