[LU-3617] configure incorrectly finds no for RDMA events 14 and 15 on latest RHEL5 Created: 22/Jul/13  Updated: 29/Oct/13  Resolved: 29/Oct/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.9
Fix Version/s: Lustre 1.8.9

Type: Bug Priority: Minor
Reporter: Kit Westneat (Inactive) Assignee: Amir Shehata (Inactive)
Resolution: Duplicate Votes: 0
Labels: mn8, patch

Severity: 3
Rank (Obsolete): 9298

 Description   

We ran into an issue at Yale recently where the Lustre servers all got RDMA_CM_EVENT_TIMEWAIT_EXIT, which the servers didn't recognize and LBUGed on. There are two problems here. The first is that configure clearly doesn't do the right thing anymore, and the second is that Lustre shouldn't LBUG if it gets those events, even if configure doesn't find them.

I double checked on your build systems to make sure it wasn't just ours:
http://build.whamcloud.com/job/lustre-b1_8/arch=x86_64,build_type=server,distro=el5,ib_stack=inkernel/258/consoleFull

That's the official 1.8.9 build console if I'm not mistaken, and you can see:
checking if OFED has RDMA_CM_EVENT_ADDR_CHANGE... no
checking if OFED has RDMA_CM_EVENT_TIMEWAIT_EXIT... no



 Comments   
Comment by Kit Westneat (Inactive) [ 22/Jul/13 ]

The patch we used:
http://review.whamcloud.com/#/c/7079/

Comment by Isaac Huang (Inactive) [ 22/Jul/13 ]

This looks like LU-3166. LBUG was used on unknown events because it's a serious situation - o2iblnd might not function properly if some unknown events are ignored.

Comment by Kit Westneat (Inactive) [ 22/Jul/13 ]

It's similar, but for RHEL in-kernel rdma instead of Mellanox. I think this shows just how fragile the configure test is. Are these tests even necessary anymore? It looks like the events were added over 5 years ago. At the very least, it seems like there should be some kind of acceptance test added to make sure that Intel releases always have support for these events.

The logic for using LBUG still seems strange - because it might not function properly, you make sure that it doesn't function properly? It seems like an error message or something would be as effective. Events aren't added to OFED very often, right?

Comment by Peter Jones [ 23/Jul/13 ]

Amir

What is your view of this?

Peter

Comment by Isaac Huang (Inactive) [ 23/Jul/13 ]

Yes, it might not work so we make sure it would not work. The consequences of ignoring unknown events are unknown, which might include data loss or corruption - we don't know, so better safe than sorry. Unknown events are supposed to happen ONLY when the o2iblnd is used with unsupported OFED release. In this case, the OFED/events are supported, it's the tests that were broken. I think the LBUG is fine - we should fix the tests, and make sure that such test failures won't happen again.

Comment by Andreas Dilger [ 29/Jul/13 ]

It looks like this is RBEL 5. What version of RHEL, and what version of OFED?

Comment by Kit Westneat (Inactive) [ 29/Jul/13 ]

That is correct. It is RHEL 5.9, with the in-kernel RDMA.

Comment by James A Simmons [ 28/Aug/13 ]

We just ran into this problem on our production systems. No detecting certain features can cause a Oops. I have a patch that fixes this problem at

http://review.whamcloud.com/#/c/7488

Sorry the patch points to the original ticket I filed.

Comment by James A Simmons [ 29/Aug/13 ]

I'm seeing Lustre initialization errors in Maloo. The logs are not very informative so does anyone know what is going wrong?

Comment by Isaac Huang (Inactive) [ 25/Sep/13 ]

James, can you please look at the questions I just posted on Gerrit?

Comment by James A Simmons [ 26/Sep/13 ]

Both test fail from the same thing.

In file included from /data/buildsystem/jsimmons-widow/rpmbuild/BUILD/kernel-2.6.18/linux-2.6.18-348.3.1.el5.x86_64/
include/rdma/rdma_cm.h:39,
from /data/buildsystem/jsimmons-widow/rpmbuild/usr/src/lustre-1.8.9/build/conftest.c:43:
/data/buildsystem/jsimmons-widow/rpmbuild/BUILD/kernel-2.6.18/linux-2.6.18-348.3.1.el5.x86_64/include/rdma/ib_addr.h
: In function 'rdma_vlan_dev_vlan_id':
/data/buildsystem/jsimmons-widow/rpmbuild/BUILD/kernel-2.6.18/linux-2.6.18-348.3.1.el5.x86_64/include/rdma/ib_addr.h
:154: error: implicit declaration of function 'vlan_dev_vlan_id'

-------------------------------------------------------------------------------------------------------------
[jsimmons@testbox linux-2.6.18-348.3.1.el5.widow.x86_64]# grep -rl vlan_dev_vlan_id .
./include/rdma/ib_addr.h
./include/scsi/fc_compat.h

So yes vlan_dev_vlan_id is located in fc_compact.h

Comment by Amir Shehata (Inactive) [ 29/Oct/13 ]

This is a duplicate of 3166, which tracks the same issue for release 2.5

Comment by Kit Westneat (Inactive) [ 29/Oct/13 ]

While the symptoms are the same, this is not the same issue. LU-3166 is configuration issues with OFED-3.5, this one is for RHEL's in-kernel OFED. I think it should be reopened as the solution is going to be different.

Generated at Sat Feb 10 01:35:27 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.