Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5151

Oops in lnet_return_rx_credits_locked

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.6.0
    • Lustre 2.6.0
    • Cray router to connect infiniband to gemini interconnect.
    • 3
    • 14214

    Description

      While testing 2.6 in my Cray test environment I keep losing my routers which NMI produces the following back traces:

      2014-06-05T16:45:09.828951-04:00 c0-0c0s2n3 Pid: 4554, comm: kiblnd_sd_01_01 Tainted: P N 3.0.82-0.7.9_1.0502.7780-cray_gem_s #1
      2014-06-05T16:45:09.828965-04:00 c0-0c0s2n3 RIP: 0010:[<ffffffffa0341831>] [<ffffffffa0341831>] lnet_return_rx_credits_locked+0x171/0x310 [lnet]
      2014-06-05T16:45:09.828971-04:00 c0-0c0s2n3 RSP: 0018:ffff8803ea379bb0 EFLAGS: 00010286
      2014-06-05T16:45:09.858936-04:00 c0-0c0s2n3 RAX: dead000000200200 RBX: ffff880317d5a800 RCX: 00000000ffffffff
      2014-06-05T16:45:09.858949-04:00 c0-0c0s2n3 RDX: dead000000100100 RSI: 0000000000000001 RDI: ffff880317d5a800
      2014-06-05T16:45:09.858960-04:00 c0-0c0s2n3 RBP: ffff8803ea379be0 R08: ffff8803e821c860 R09: ffff880317d5a850
      2014-06-05T16:45:09.858970-04:00 c0-0c0s2n3 R10: 0000000000000000 R11: 0000000000000000 R12: ffff880317d5a800
      2014-06-05T16:45:09.858977-04:00 c0-0c0s2n3 R13: ffff8803daf91880 R14: 00000000fffffff5 R15: 0000000000000001
      2014-06-05T16:45:09.888794-04:00 c0-0c0s2n3 FS: 00007f28c44457a0(0000) GS:ffff880407cc0000(0000) knlGS:0000000000000000
      2014-06-05T16:45:09.888807-04:00 c0-0c0s2n3 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      2014-06-05T16:45:09.888818-04:00 c0-0c0s2n3 CR2: 000000000063c800 CR3: 000000031f33f000 CR4: 00000000000007e0
      2014-06-05T16:45:09.888824-04:00 c0-0c0s2n3 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      2014-06-05T16:45:09.888834-04:00 c0-0c0s2n3 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      2014-06-05T16:45:09.918910-04:00 c0-0c0s2n3 Process kiblnd_sd_01_01 (pid: 4554, threadinfo ffff8803ea378000, task ffff8803e89480c0)
      2014-06-05T16:45:09.918924-04:00 c0-0c0s2n3 Stack:
      2014-06-05T16:45:09.918940-04:00 c0-0c0s2n3 ffff8803ea379bd0 ffff880317d5a800 0000000000000001 0000000000000001
      2014-06-05T16:45:09.918951-04:00 c0-0c0s2n3 00000000fffffff5 0000000000000001 ffff8803ea379c10 ffffffffa0338b28
      2014-06-05T16:45:09.918956-04:00 c0-0c0s2n3 ffff880317d5a918 dead000000200200 ffff880317d5a800 ffff8803e9b18d80
      2014-06-05T16:45:09.918961-04:00 c0-0c0s2n3 Call Trace:
      2014-06-05T16:45:09.918966-04:00 c0-0c0s2n3 [<ffffffffa0338b28>] lnet_msg_decommit+0xf8/0x6b0 [lnet]
      2014-06-05T16:45:09.948770-04:00 c0-0c0s2n3 [<ffffffffa0339b47>] lnet_finalize+0x297/0x7d0 [lnet]
      2014-06-05T16:45:09.948783-04:00 c0-0c0s2n3 [<ffffffffa03465ed>] lnet_parse+0xc2d/0x1b80 [lnet]
      2014-06-05T16:45:09.948794-04:00 c0-0c0s2n3 [<ffffffffa03db68a>] kiblnd_handle_rx+0x30a/0x690 [ko2iblnd]
      2014-06-05T16:45:09.948805-04:00 c0-0c0s2n3 [<ffffffffa03e03af>] kiblnd_rx_complete+0x34f/0x420 [ko2iblnd]
      2014-06-05T16:45:09.948815-04:00 c0-0c0s2n3 [<ffffffffa03e0d25>] kiblnd_scheduler+0x7c5/0x970 [ko2iblnd]
      2014-06-05T16:45:09.948821-04:00 c0-0c0s2n3 [<ffffffff810672fe>] kthread+0x9e/0xb0
      2014-06-05T16:45:09.978765-04:00 c0-0c0s2n3 [<ffffffff81481874>] kernel_thread_helper+0x4/0x10
      2014-06-05T16:45:09.978785-04:00 c0-0c0s2n3 Code: c2 0f 85 2b 01 00 00 8d 41 01 85 c0 41 89 45 48 0f 8f dc fe ff ff 49 8b 7d 20 be 01 00 00 00 48 83
      ef 10 48 8b 47 18 48 8b 57 10
      2014-06-05T16:45:10.004304-04:00 c0-0c0s2n3 89 42 08 48 89 10 48 b8 00 01 10 00 00 00 ad de 48 89 47 10
      2014-06-05T16:45:10.004326-04:00 c0-0c0s2n3 RIP [<ffffffffa0341831>] lnet_return_rx_credits_locked+0x171/0x310 [lnet]
      2014-06-05T16:45:10.004333-04:00 c0-0c0s2n3 RSP <ffff8803ea379bb0>
      2014-06-05T16:45:10.029888-04:00 c0-0c0s2n3 --[ end trace 17126666cf42dece ]--

      Attachments

        Activity

          [LU-5151] Oops in lnet_return_rx_credits_locked

          Patch landed to Master.

          jlevi Jodi Levi (Inactive) added a comment - Patch landed to Master.

          welcome, it's assigned to me anyway . I think we can close it now because it's landed which is the only place we have this issue.

          liang Liang Zhen (Inactive) added a comment - welcome, it's assigned to me anyway . I think we can close it now because it's landed which is the only place we have this issue.

          Thank you Liang for working with me on this issue. IMHO the new LNET_CREDIT_* values make it far more clear what is going on than using EAGAIN or just returning zero.

          simmonsja James A Simmons added a comment - Thank you Liang for working with me on this issue. IMHO the new LNET_CREDIT_* values make it far more clear what is going on than using EAGAIN or just returning zero.
          spitzcor Cory Spitz added a comment -

          James, thanks for reporting this problem. We have seen this at Cray as well, of course, and only on routers as you had. We have been working around the problem by using b2_5 vintage routers.

          spitzcor Cory Spitz added a comment - James, thanks for reporting this problem. We have seen this at Cray as well, of course, and only on routers as you had. We have been working around the problem by using b2_5 vintage routers.
          liang Liang Zhen (Inactive) added a comment - - edited

          sorry I didn't notice there is update and review comments on patch, and overwrote it.
          After rethink, I agree with Isaac and still tend to keep positive value, it will be much cleaner to me.

          liang Liang Zhen (Inactive) added a comment - - edited sorry I didn't notice there is update and review comments on patch, and overwrote it. After rethink, I agree with Isaac and still tend to keep positive value, it will be much cleaner to me.
          simmonsja James A Simmons added a comment - - edited

          I see their are strong opinions on this. I purpose that if you want to continue positive values in some of the function that we call them something else besides EAGAIN and ENOENT. This will be flagged by the HPPD checker and upstream using those values will be frowned on. So I suggest you define your own errors so EAGAIN will become LNET_RETRY and ENOENT becomes LNET_MISMATCH. Can you live with this compromise?

          simmonsja James A Simmons added a comment - - edited I see their are strong opinions on this. I purpose that if you want to continue positive values in some of the function that we call them something else besides EAGAIN and ENOENT. This will be flagged by the HPPD checker and upstream using those values will be frowned on. So I suggest you define your own errors so EAGAIN will become LNET_RETRY and ENOENT becomes LNET_MISMATCH. Can you live with this compromise?

          I wouldn't consider that a huge limitation. The LASSERT in lnet_ni_eager_recv would handle this case today. So a potential driver writer would know not to return a -EAGAIN.

          simmonsja James A Simmons added a comment - I wouldn't consider that a huge limitation. The LASSERT in lnet_ni_eager_recv would handle this case today. So a potential driver writer would know not to return a -EAGAIN.

          Hi James, both way should be fine, my concern about for -EAGAIN is, we need to impose another condition to LND: -EAGAIN should never returned by lnet_ni_eager_recv.
          It is true for now, but it's probably better not to have this limit?

          liang Liang Zhen (Inactive) added a comment - Hi James, both way should be fine, my concern about for -EAGAIN is, we need to impose another condition to LND: -EAGAIN should never returned by lnet_ni_eager_recv. It is true for now, but it's probably better not to have this limit?

          I tired the above version of my patch and it resolved the issue. Do you Liang mind if we go with that version instead.

          simmonsja James A Simmons added a comment - I tired the above version of my patch and it resolved the issue. Do you Liang mind if we go with that version instead.

          I changed it from EAGAIN to -EAGAIN so it matches the behavior in the upstream kernel. It is frowned on to use EAGAIN in kernel space. How about instead we just do:

          diff --git a/lnet/lnet/lib-move.c b/lnet/lnet/lib-move.c
          index 6097ae0..5fcc19b 100644
          — a/lnet/lnet/lib-move.c
          +++ b/lnet/lnet/lib-move.c
          @@ -1961,12 +1961,14 @@ lnet_parse(lnet_ni_t *ni, lnet_hdr_t *hdr, lnet_nid_t from_nid,
          rc = lnet_parse_forward_locked(ni, msg);
          lnet_net_unlock(cpt);

          • if (rc < 0)
            + if (rc == -EAGAIN) /* waiting for buffer */
            + return 0;
            +
            + if (rc != 0)
            goto free_drop;
          • if (rc == 0) { - lnet_ni_recv(ni, msg->msg_private, msg, 0, - 0, payload_length, payload_length); - }

            +
            + lnet_ni_recv(ni, msg->msg_private, msg, 0,
            + 0, payload_length, payload_length);
            return 0;
            }

          simmonsja James A Simmons added a comment - I changed it from EAGAIN to -EAGAIN so it matches the behavior in the upstream kernel. It is frowned on to use EAGAIN in kernel space. How about instead we just do: diff --git a/lnet/lnet/lib-move.c b/lnet/lnet/lib-move.c index 6097ae0..5fcc19b 100644 — a/lnet/lnet/lib-move.c +++ b/lnet/lnet/lib-move.c @@ -1961,12 +1961,14 @@ lnet_parse(lnet_ni_t *ni, lnet_hdr_t *hdr, lnet_nid_t from_nid, rc = lnet_parse_forward_locked(ni, msg); lnet_net_unlock(cpt); if (rc < 0) + if (rc == -EAGAIN) /* waiting for buffer */ + return 0; + + if (rc != 0) goto free_drop; if (rc == 0) { - lnet_ni_recv(ni, msg->msg_private, msg, 0, - 0, payload_length, payload_length); - } + + lnet_ni_recv(ni, msg->msg_private, msg, 0, + 0, payload_length, payload_length); return 0; }

          People

            liang Liang Zhen (Inactive)
            simmonsja James A Simmons
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: