[LU-5151] Oops in lnet_return_rx_credits_locked Created: 05/Jun/14 Updated: 11/Jun/14 Resolved: 11/Jun/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.6.0 |
| Fix Version/s: | Lustre 2.6.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | James A Simmons | Assignee: | Liang Zhen (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | lnet | ||
| Environment: |
Cray router to connect infiniband to gemini interconnect. |
||
| Severity: | 3 |
| Epic: | lnet |
| Rank (Obsolete): | 14214 |
| Description |
|
While testing 2.6 in my Cray test environment I keep losing my routers which NMI produces the following back traces: 2014-06-05T16:45:09.828951-04:00 c0-0c0s2n3 Pid: 4554, comm: kiblnd_sd_01_01 Tainted: P N 3.0.82-0.7.9_1.0502.7780-cray_gem_s #1 |
| Comments |
| Comment by James Nunez (Inactive) [ 05/Jun/14 ] |
|
Liang, Would you please comment on this ticket? Thank you, |
| Comment by Liang Zhen (Inactive) [ 06/Jun/14 ] |
|
Hi James, to narrow down the problem, have you ever seen this issue with other versions between 2.4 and 2.6? thanks |
| Comment by Liang Zhen (Inactive) [ 06/Jun/14 ] |
|
I see the reason here, I think it's because this patch changed returned value of lnet_post_routed_recv_locked() from positive to negative (http://review.whamcloud.com/#/c/9369/) if (!for_me) {
rc = lnet_parse_forward_locked(ni, msg);
lnet_net_unlock(cpt);
if (rc < 0)
goto free_drop;
if (rc == 0) {
lnet_ni_recv(ni, msg->msg_private, msg, 0,
0, payload_length, payload_length);
}
return 0;
}
|
| Comment by James A Simmons [ 06/Jun/14 ] |
|
I have only seen the problem with 2.6. |
| Comment by Liang Zhen (Inactive) [ 06/Jun/14 ] |
|
patch is here: http://review.whamcloud.com/#/c/10625/ |
| Comment by James A Simmons [ 06/Jun/14 ] |
|
I changed it from EAGAIN to -EAGAIN so it matches the behavior in the upstream kernel. It is frowned on to use EAGAIN in kernel space. How about instead we just do: diff --git a/lnet/lnet/lib-move.c b/lnet/lnet/lib-move.c
|
| Comment by James A Simmons [ 09/Jun/14 ] |
|
I tired the above version of my patch and it resolved the issue. Do you Liang mind if we go with that version instead. |
| Comment by Liang Zhen (Inactive) [ 09/Jun/14 ] |
|
Hi James, both way should be fine, my concern about for -EAGAIN is, we need to impose another condition to LND: -EAGAIN should never returned by lnet_ni_eager_recv. |
| Comment by James A Simmons [ 09/Jun/14 ] |
|
I wouldn't consider that a huge limitation. The LASSERT in lnet_ni_eager_recv would handle this case today. So a potential driver writer would know not to return a -EAGAIN. |
| Comment by James A Simmons [ 09/Jun/14 ] |
|
I see their are strong opinions on this. I purpose that if you want to continue positive values in some of the function that we call them something else besides EAGAIN and ENOENT. This will be flagged by the HPPD checker and upstream using those values will be frowned on. So I suggest you define your own errors so EAGAIN will become LNET_RETRY and ENOENT becomes LNET_MISMATCH. Can you live with this compromise? |
| Comment by Liang Zhen (Inactive) [ 10/Jun/14 ] |
|
sorry I didn't notice there is update and review comments on patch, and overwrote it. |
| Comment by Cory Spitz [ 10/Jun/14 ] |
|
James, thanks for reporting this problem. We have seen this at Cray as well, of course, and only on routers as you had. We have been working around the problem by using b2_5 vintage routers. |
| Comment by James A Simmons [ 10/Jun/14 ] |
|
Thank you Liang for working with me on this issue. IMHO the new LNET_CREDIT_* values make it far more clear what is going on than using EAGAIN or just returning zero. |
| Comment by Liang Zhen (Inactive) [ 11/Jun/14 ] |
|
welcome, it's assigned to me anyway |
| Comment by Jodi Levi (Inactive) [ 11/Jun/14 ] |
|
Patch landed to Master. |