[LU-3617] configure incorrectly finds no for RDMA events 14 and 15 on latest RHEL5 Created: 22/Jul/13 Updated: 29/Oct/13 Resolved: 29/Oct/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 1.8.9 |
| Fix Version/s: | Lustre 1.8.9 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Kit Westneat (Inactive) | Assignee: | Amir Shehata (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | mn8, patch | ||
| Severity: | 3 |
| Rank (Obsolete): | 9298 |
| Description |
|
We ran into an issue at Yale recently where the Lustre servers all got RDMA_CM_EVENT_TIMEWAIT_EXIT, which the servers didn't recognize and LBUGed on. There are two problems here. The first is that configure clearly doesn't do the right thing anymore, and the second is that Lustre shouldn't LBUG if it gets those events, even if configure doesn't find them. I double checked on your build systems to make sure it wasn't just ours: That's the official 1.8.9 build console if I'm not mistaken, and you can see: |
| Comments |
| Comment by Kit Westneat (Inactive) [ 22/Jul/13 ] |
|
The patch we used: |
| Comment by Isaac Huang (Inactive) [ 22/Jul/13 ] |
|
This looks like |
| Comment by Kit Westneat (Inactive) [ 22/Jul/13 ] |
|
It's similar, but for RHEL in-kernel rdma instead of Mellanox. I think this shows just how fragile the configure test is. Are these tests even necessary anymore? It looks like the events were added over 5 years ago. At the very least, it seems like there should be some kind of acceptance test added to make sure that Intel releases always have support for these events. The logic for using LBUG still seems strange - because it might not function properly, you make sure that it doesn't function properly? It seems like an error message or something would be as effective. Events aren't added to OFED very often, right? |
| Comment by Peter Jones [ 23/Jul/13 ] |
|
Amir What is your view of this? Peter |
| Comment by Isaac Huang (Inactive) [ 23/Jul/13 ] |
|
Yes, it might not work so we make sure it would not work. The consequences of ignoring unknown events are unknown, which might include data loss or corruption - we don't know, so better safe than sorry. Unknown events are supposed to happen ONLY when the o2iblnd is used with unsupported OFED release. In this case, the OFED/events are supported, it's the tests that were broken. I think the LBUG is fine - we should fix the tests, and make sure that such test failures won't happen again. |
| Comment by Andreas Dilger [ 29/Jul/13 ] |
|
It looks like this is RBEL 5. What version of RHEL, and what version of OFED? |
| Comment by Kit Westneat (Inactive) [ 29/Jul/13 ] |
|
That is correct. It is RHEL 5.9, with the in-kernel RDMA. |
| Comment by James A Simmons [ 28/Aug/13 ] |
|
We just ran into this problem on our production systems. No detecting certain features can cause a Oops. I have a patch that fixes this problem at http://review.whamcloud.com/#/c/7488 Sorry the patch points to the original ticket I filed. |
| Comment by James A Simmons [ 29/Aug/13 ] |
|
I'm seeing Lustre initialization errors in Maloo. The logs are not very informative so does anyone know what is going wrong? |
| Comment by Isaac Huang (Inactive) [ 25/Sep/13 ] |
|
James, can you please look at the questions I just posted on Gerrit? |
| Comment by James A Simmons [ 26/Sep/13 ] |
|
Both test fail from the same thing. In file included from /data/buildsystem/jsimmons-widow/rpmbuild/BUILD/kernel-2.6.18/linux-2.6.18-348.3.1.el5.x86_64/ ------------------------------------------------------------------------------------------------------------- So yes vlan_dev_vlan_id is located in fc_compact.h |
| Comment by Amir Shehata (Inactive) [ 29/Oct/13 ] |
|
This is a duplicate of 3166, which tracks the same issue for release 2.5 |
| Comment by Kit Westneat (Inactive) [ 29/Oct/13 ] |
|
While the symptoms are the same, this is not the same issue. |