[LU-12441] Response tracker is not detached on router ping reply Created: 15/Jun/19  Updated: 12/Dec/19  Resolved: 09/Aug/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0, Lustre 2.12.2
Fix Version/s: Lustre 2.13.0, Lustre 2.12.4

Type: Bug Priority: Minor
Reporter: Chris Horn Assignee: Chris Horn
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-12568 LNetError: 28086:0:(lib-move.c:2862:l... Resolved
is related to LU-12906 LBUG ASSERTION( rspt->rspt_cpt == cpt... Resolved
is related to LU-12907 LNet routers: LNetError: 14141:0:(lib... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This leads to some false-positive "timeouts". In the excerpt below we see the response tracker attached to the ping msg, we see the reply, but we never unlink the md so the response tracker does not get detached.

00000400:00000200:26.0:1560633919.659270:0:2482:0:(router.c:1045:lnet_ping_router_locked()) Check: 12345-93@gni4
00000400:00000200:26.0:1560633919.659290:0:2482:0:(lib-move.c:4844:LNetGet()) LNetGet msg ffff8807832cdc00 -> 12345-93@gni4
00000400:00000200:26.0:1560633919.659291:0:2482:0:(lib-msg.c:364:lnet_msg_attach_md()) attached md ffff88078a861f68 to msg ffff8807832cdc00
00000400:00000200:26.0:1560633919.659293:0:2482:0:(lib-move.c:4505:lnet_attach_rsp_tracker()) Add rspt ffff88078b5ef000 to md ffff88078a861f68 dl 1560633969s ne false
00000400:00000200:6.0:1560633919.659345:0:2479:0:(lib-msg.c:775:lnet_msg_detach_md()) ffff88078a861f68 ref 0 fl 2 thr -1 opt 10 off 0 size 0 len 272 msg ffff8807832cdc00 unlink false
00000400:00000200:6.0:1560633919.659466:0:2479:0:(lib-move.c:3890:lnet_parse_reply()) 60@gni4: Reply msg ffff88078c803800 from 12345-93@gni4 of length 80/80 into md 0x65931
00000400:00000200:6.0:1560633919.659467:0:2479:0:(lib-msg.c:364:lnet_msg_attach_md()) attached md ffff88078a861f68 to msg ffff88078c803800
00000400:00000200:6.0:1560633919.659474:0:2479:0:(router.c:120:lnet_notify_locked()) Old news
00000400:00000200:6.0:1560633919.659475:0:2479:0:(lib-msg.c:775:lnet_msg_detach_md()) ffff88078a861f68 ref 0 fl 2 thr -1 opt 10 off 0 size 0 len 272 msg ffff88078c803800 unlink false
00000400:00000200:6.0:1560633919.659507:0:2479:0:(router.c:120:lnet_notify_locked()) Old news
00000400:00000200:6.0:1560633919.659578:0:2479:0:(router.c:120:lnet_notify_locked()) Old news
00000400:00000200:6.0:1560633919.659619:0:2479:0:(router.c:120:lnet_notify_locked()) Old news
00000400:00000100:26.0:1560633977.003290:0:2482:0:(lib-move.c:2781:lnet_finalize_expired_responses()) Response timed out: md = ffff88078a861f68: nid = 93@gni4


 Comments   
Comment by Chris Horn [ 15/Jun/19 ]

I think the solution here is to detach the response tracker in lnet_router_checker_event() for the reply case.

Comment by Gerrit Updater [ 16/Jun/19 ]

Edit - Removed reference to abandoned patch

Comment by Gerrit Updater [ 09/Jul/19 ]

Chris Horn (hornc@cray.com) uploaded a new patch: https://review.whamcloud.com/35452
Subject: LU-12441 lnet: response tracker cleanup
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 5812fe14104d090b65be0c353e9088b079e0ce42

Comment by Gerrit Updater [ 09/Aug/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35452/
Subject: LU-12441 lnet: Detach rspt when md_threshold is infinite
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: ebbf909a1c2d0f5400da2d98e1bb274a9e82e0a5

Comment by Peter Jones [ 09/Aug/19 ]

Landed for 2.13

Comment by Gerrit Updater [ 31/Oct/19 ]

Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36634
Subject: LU-12441 lnet: Detach rspt when md_threshold is infinite
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: bf8bfe8338ac6d3a5715f66f1f845b9618d270dc

Comment by Gerrit Updater [ 05/Dec/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36634/
Subject: LU-12441 lnet: Detach rspt when md_threshold is infinite
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: c095fbda55ca632cff2696550f22a13a19ee4514

Generated at Sat Feb 10 02:52:37 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.