[LU-5151] Oops in lnet_return_rx_credits_locked Created: 05/Jun/14  Updated: 11/Jun/14  Resolved: 11/Jun/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0
Fix Version/s: Lustre 2.6.0

Type: Bug Priority: Blocker
Reporter: James A Simmons Assignee: Liang Zhen (Inactive)
Resolution: Fixed Votes: 0
Labels: lnet
Environment:

Cray router to connect infiniband to gemini interconnect.


Severity: 3
Epic: lnet
Rank (Obsolete): 14214

 Description   

While testing 2.6 in my Cray test environment I keep losing my routers which NMI produces the following back traces:

2014-06-05T16:45:09.828951-04:00 c0-0c0s2n3 Pid: 4554, comm: kiblnd_sd_01_01 Tainted: P N 3.0.82-0.7.9_1.0502.7780-cray_gem_s #1
2014-06-05T16:45:09.828965-04:00 c0-0c0s2n3 RIP: 0010:[<ffffffffa0341831>] [<ffffffffa0341831>] lnet_return_rx_credits_locked+0x171/0x310 [lnet]
2014-06-05T16:45:09.828971-04:00 c0-0c0s2n3 RSP: 0018:ffff8803ea379bb0 EFLAGS: 00010286
2014-06-05T16:45:09.858936-04:00 c0-0c0s2n3 RAX: dead000000200200 RBX: ffff880317d5a800 RCX: 00000000ffffffff
2014-06-05T16:45:09.858949-04:00 c0-0c0s2n3 RDX: dead000000100100 RSI: 0000000000000001 RDI: ffff880317d5a800
2014-06-05T16:45:09.858960-04:00 c0-0c0s2n3 RBP: ffff8803ea379be0 R08: ffff8803e821c860 R09: ffff880317d5a850
2014-06-05T16:45:09.858970-04:00 c0-0c0s2n3 R10: 0000000000000000 R11: 0000000000000000 R12: ffff880317d5a800
2014-06-05T16:45:09.858977-04:00 c0-0c0s2n3 R13: ffff8803daf91880 R14: 00000000fffffff5 R15: 0000000000000001
2014-06-05T16:45:09.888794-04:00 c0-0c0s2n3 FS: 00007f28c44457a0(0000) GS:ffff880407cc0000(0000) knlGS:0000000000000000
2014-06-05T16:45:09.888807-04:00 c0-0c0s2n3 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
2014-06-05T16:45:09.888818-04:00 c0-0c0s2n3 CR2: 000000000063c800 CR3: 000000031f33f000 CR4: 00000000000007e0
2014-06-05T16:45:09.888824-04:00 c0-0c0s2n3 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
2014-06-05T16:45:09.888834-04:00 c0-0c0s2n3 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
2014-06-05T16:45:09.918910-04:00 c0-0c0s2n3 Process kiblnd_sd_01_01 (pid: 4554, threadinfo ffff8803ea378000, task ffff8803e89480c0)
2014-06-05T16:45:09.918924-04:00 c0-0c0s2n3 Stack:
2014-06-05T16:45:09.918940-04:00 c0-0c0s2n3 ffff8803ea379bd0 ffff880317d5a800 0000000000000001 0000000000000001
2014-06-05T16:45:09.918951-04:00 c0-0c0s2n3 00000000fffffff5 0000000000000001 ffff8803ea379c10 ffffffffa0338b28
2014-06-05T16:45:09.918956-04:00 c0-0c0s2n3 ffff880317d5a918 dead000000200200 ffff880317d5a800 ffff8803e9b18d80
2014-06-05T16:45:09.918961-04:00 c0-0c0s2n3 Call Trace:
2014-06-05T16:45:09.918966-04:00 c0-0c0s2n3 [<ffffffffa0338b28>] lnet_msg_decommit+0xf8/0x6b0 [lnet]
2014-06-05T16:45:09.948770-04:00 c0-0c0s2n3 [<ffffffffa0339b47>] lnet_finalize+0x297/0x7d0 [lnet]
2014-06-05T16:45:09.948783-04:00 c0-0c0s2n3 [<ffffffffa03465ed>] lnet_parse+0xc2d/0x1b80 [lnet]
2014-06-05T16:45:09.948794-04:00 c0-0c0s2n3 [<ffffffffa03db68a>] kiblnd_handle_rx+0x30a/0x690 [ko2iblnd]
2014-06-05T16:45:09.948805-04:00 c0-0c0s2n3 [<ffffffffa03e03af>] kiblnd_rx_complete+0x34f/0x420 [ko2iblnd]
2014-06-05T16:45:09.948815-04:00 c0-0c0s2n3 [<ffffffffa03e0d25>] kiblnd_scheduler+0x7c5/0x970 [ko2iblnd]
2014-06-05T16:45:09.948821-04:00 c0-0c0s2n3 [<ffffffff810672fe>] kthread+0x9e/0xb0
2014-06-05T16:45:09.978765-04:00 c0-0c0s2n3 [<ffffffff81481874>] kernel_thread_helper+0x4/0x10
2014-06-05T16:45:09.978785-04:00 c0-0c0s2n3 Code: c2 0f 85 2b 01 00 00 8d 41 01 85 c0 41 89 45 48 0f 8f dc fe ff ff 49 8b 7d 20 be 01 00 00 00 48 83
ef 10 48 8b 47 18 48 8b 57 10
2014-06-05T16:45:10.004304-04:00 c0-0c0s2n3 89 42 08 48 89 10 48 b8 00 01 10 00 00 00 ad de 48 89 47 10
2014-06-05T16:45:10.004326-04:00 c0-0c0s2n3 RIP [<ffffffffa0341831>] lnet_return_rx_credits_locked+0x171/0x310 [lnet]
2014-06-05T16:45:10.004333-04:00 c0-0c0s2n3 RSP <ffff8803ea379bb0>
2014-06-05T16:45:10.029888-04:00 c0-0c0s2n3 --[ end trace 17126666cf42dece ]--



 Comments   
Comment by James Nunez (Inactive) [ 05/Jun/14 ]

Liang,

Would you please comment on this ticket?

Thank you,
James

Comment by Liang Zhen (Inactive) [ 06/Jun/14 ]

Hi James, to narrow down the problem, have you ever seen this issue with other versions between 2.4 and 2.6? thanks

Comment by Liang Zhen (Inactive) [ 06/Jun/14 ]

I see the reason here, I think it's because this patch changed returned value of lnet_post_routed_recv_locked() from positive to negative (http://review.whamcloud.com/#/c/9369/)
which means lnet_parse_forward_locked()->lnet_post_routed_recv_locked() will return -EAGAIN instead of EAGAIN, and it will be treated as real error in lnet_parse, although it is not a real error because EAGAIN means message is waiting for router buffer, and we will try to finalise a message which is still queued:

        if (!for_me) {
                rc = lnet_parse_forward_locked(ni, msg);
                lnet_net_unlock(cpt);

                if (rc < 0)
                        goto free_drop;
                if (rc == 0) {
                        lnet_ni_recv(ni, msg->msg_private, msg, 0,
                                     0, payload_length, payload_length);
                }
                return 0;
        }
Comment by James A Simmons [ 06/Jun/14 ]

I have only seen the problem with 2.6.

Comment by Liang Zhen (Inactive) [ 06/Jun/14 ]

patch is here: http://review.whamcloud.com/#/c/10625/

Comment by James A Simmons [ 06/Jun/14 ]

I changed it from EAGAIN to -EAGAIN so it matches the behavior in the upstream kernel. It is frowned on to use EAGAIN in kernel space. How about instead we just do:

diff --git a/lnet/lnet/lib-move.c b/lnet/lnet/lib-move.c
index 6097ae0..5fcc19b 100644
— a/lnet/lnet/lib-move.c
+++ b/lnet/lnet/lib-move.c
@@ -1961,12 +1961,14 @@ lnet_parse(lnet_ni_t *ni, lnet_hdr_t *hdr, lnet_nid_t from_nid,
rc = lnet_parse_forward_locked(ni, msg);
lnet_net_unlock(cpt);

  • if (rc < 0)
    + if (rc == -EAGAIN) /* waiting for buffer */
    + return 0;
    +
    + if (rc != 0)
    goto free_drop;
  • if (rc == 0) { - lnet_ni_recv(ni, msg->msg_private, msg, 0, - 0, payload_length, payload_length); - }

    +
    + lnet_ni_recv(ni, msg->msg_private, msg, 0,
    + 0, payload_length, payload_length);
    return 0;
    }

Comment by James A Simmons [ 09/Jun/14 ]

I tired the above version of my patch and it resolved the issue. Do you Liang mind if we go with that version instead.

Comment by Liang Zhen (Inactive) [ 09/Jun/14 ]

Hi James, both way should be fine, my concern about for -EAGAIN is, we need to impose another condition to LND: -EAGAIN should never returned by lnet_ni_eager_recv.
It is true for now, but it's probably better not to have this limit?

Comment by James A Simmons [ 09/Jun/14 ]

I wouldn't consider that a huge limitation. The LASSERT in lnet_ni_eager_recv would handle this case today. So a potential driver writer would know not to return a -EAGAIN.

Comment by James A Simmons [ 09/Jun/14 ]

I see their are strong opinions on this. I purpose that if you want to continue positive values in some of the function that we call them something else besides EAGAIN and ENOENT. This will be flagged by the HPPD checker and upstream using those values will be frowned on. So I suggest you define your own errors so EAGAIN will become LNET_RETRY and ENOENT becomes LNET_MISMATCH. Can you live with this compromise?

Comment by Liang Zhen (Inactive) [ 10/Jun/14 ]

sorry I didn't notice there is update and review comments on patch, and overwrote it.
After rethink, I agree with Isaac and still tend to keep positive value, it will be much cleaner to me.

Comment by Cory Spitz [ 10/Jun/14 ]

James, thanks for reporting this problem. We have seen this at Cray as well, of course, and only on routers as you had. We have been working around the problem by using b2_5 vintage routers.

Comment by James A Simmons [ 10/Jun/14 ]

Thank you Liang for working with me on this issue. IMHO the new LNET_CREDIT_* values make it far more clear what is going on than using EAGAIN or just returning zero.

Comment by Liang Zhen (Inactive) [ 11/Jun/14 ]

welcome, it's assigned to me anyway . I think we can close it now because it's landed which is the only place we have this issue.

Comment by Jodi Levi (Inactive) [ 11/Jun/14 ]

Patch landed to Master.

Generated at Sat Feb 10 01:48:58 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.