[LU-12441] Response tracker is not detached on router ping reply Created: 15/Jun/19 Updated: 12/Dec/19 Resolved: 09/Aug/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.13.0, Lustre 2.12.2 |
| Fix Version/s: | Lustre 2.13.0, Lustre 2.12.4 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Chris Horn | Assignee: | Chris Horn |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
This leads to some false-positive "timeouts". In the excerpt below we see the response tracker attached to the ping msg, we see the reply, but we never unlink the md so the response tracker does not get detached. 00000400:00000200:26.0:1560633919.659270:0:2482:0:(router.c:1045:lnet_ping_router_locked()) Check: 12345-93@gni4 00000400:00000200:26.0:1560633919.659290:0:2482:0:(lib-move.c:4844:LNetGet()) LNetGet msg ffff8807832cdc00 -> 12345-93@gni4 00000400:00000200:26.0:1560633919.659291:0:2482:0:(lib-msg.c:364:lnet_msg_attach_md()) attached md ffff88078a861f68 to msg ffff8807832cdc00 00000400:00000200:26.0:1560633919.659293:0:2482:0:(lib-move.c:4505:lnet_attach_rsp_tracker()) Add rspt ffff88078b5ef000 to md ffff88078a861f68 dl 1560633969s ne false 00000400:00000200:6.0:1560633919.659345:0:2479:0:(lib-msg.c:775:lnet_msg_detach_md()) ffff88078a861f68 ref 0 fl 2 thr -1 opt 10 off 0 size 0 len 272 msg ffff8807832cdc00 unlink false 00000400:00000200:6.0:1560633919.659466:0:2479:0:(lib-move.c:3890:lnet_parse_reply()) 60@gni4: Reply msg ffff88078c803800 from 12345-93@gni4 of length 80/80 into md 0x65931 00000400:00000200:6.0:1560633919.659467:0:2479:0:(lib-msg.c:364:lnet_msg_attach_md()) attached md ffff88078a861f68 to msg ffff88078c803800 00000400:00000200:6.0:1560633919.659474:0:2479:0:(router.c:120:lnet_notify_locked()) Old news 00000400:00000200:6.0:1560633919.659475:0:2479:0:(lib-msg.c:775:lnet_msg_detach_md()) ffff88078a861f68 ref 0 fl 2 thr -1 opt 10 off 0 size 0 len 272 msg ffff88078c803800 unlink false 00000400:00000200:6.0:1560633919.659507:0:2479:0:(router.c:120:lnet_notify_locked()) Old news 00000400:00000200:6.0:1560633919.659578:0:2479:0:(router.c:120:lnet_notify_locked()) Old news 00000400:00000200:6.0:1560633919.659619:0:2479:0:(router.c:120:lnet_notify_locked()) Old news 00000400:00000100:26.0:1560633977.003290:0:2482:0:(lib-move.c:2781:lnet_finalize_expired_responses()) Response timed out: md = ffff88078a861f68: nid = 93@gni4 |
| Comments |
| Comment by Chris Horn [ 15/Jun/19 ] |
|
I think the solution here is to detach the response tracker in lnet_router_checker_event() for the reply case. |
| Comment by Gerrit Updater [ 16/Jun/19 ] |
|
Edit - Removed reference to abandoned patch |
| Comment by Gerrit Updater [ 09/Jul/19 ] |
|
Chris Horn (hornc@cray.com) uploaded a new patch: https://review.whamcloud.com/35452 |
| Comment by Gerrit Updater [ 09/Aug/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35452/ |
| Comment by Peter Jones [ 09/Aug/19 ] |
|
Landed for 2.13 |
| Comment by Gerrit Updater [ 31/Oct/19 ] |
|
Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36634 |
| Comment by Gerrit Updater [ 05/Dec/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36634/ |