[LU-1938] Operational incompatibility - RHEL6.2, kernel-2.6.32-279.2.1.el6_x86_64, MLNX_OFED-1.5.3-3.1.0 Created: 14/Sep/12 Updated: 08/Mar/14 Resolved: 08/Mar/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Jeff Johnson (Inactive) | Assignee: | Doug Oucharek (Inactive) |
| Resolution: | Not a Bug | Votes: | 0 |
| Labels: | None | ||
| Environment: |
RHEL 6.2, kernel-2.6.32-279.2.1.el6_x86_64, RHEL kernel_ib on MDS/OSS nodes, Mellanox OFED 1-5.3-3.1.0 on client. |
||
| Severity: | 3 |
| Epic: | hang |
| Rank (Obsolete): | 10076 |
| Description |
|
Newly deployed filesystem. Running initial test rsync from client to filesystem when rsync process hung. Also see other weirdness. Poor performance. Network errors (example below). Lustre: 16691:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1347583367/real 1347583367] req@ffff880664530c00 x1413043208599797/t0(0) o8->alicefs-OST0006-osc-ffff880832436000@172.20.2.125@o2ib:28/4 lens 368/512 e 0 to 1 dl 1347583422 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 INFO: task rsync:16826 blocked for more than 120 seconds. |
| Comments |
| Comment by Jeff Johnson (Inactive) [ 14/Sep/12 ] |
|
Is it necessary to run the same IB stack on client and server nodes? Server side is currently running RHEL 6.2 in kernel IB. Clients are running MLNX_OFED-1.5.3-3.1.0 |
| Comment by Peter Jones [ 14/Sep/12 ] |
|
Doug will help with this one |
| Comment by Jeff Johnson (Inactive) [ 14/Sep/12 ] |
|
I have rebuild the server side rpms against the same Mellanox OFED release used on the client side. I will have the filesystem up in a few minutes and will report outcome. Kernel: 2.6.32-279.2.1.el6_lustre.gc46c389.x86_64 |
| Comment by Doug Oucharek (Inactive) [ 14/Sep/12 ] |
|
It looked like the rebuild address the IB issues (lent_selftest works fine). Outstanding issues: rsync hangs after a few seconds and standard df hangs (but lfs df works). |
| Comment by Doug Oucharek (Inactive) [ 17/Sep/12 ] |
|
Jeff: did you get a chance to verify if rsync and dd are now working? |
| Comment by Jeff Johnson (Inactive) [ 18/Sep/12 ] |
|
Doug, I recompiled Lustre 2.1.3 source against Mellanox OFED 1.5.3-3.1.0. With the server and client side running the same IB stack it appears there are no problems. I had seen this once before in RHEL5.5, Lustre 1.8.4 where the client/cluster side ran OFED and the Lustre server side ran RHEL's in-kernel IB and the lnet communication was dysfunctional. As we know, standards aren't really that standard. I am going to test this on our in-house cluster when I get back to the US. I think there is a lingering incompatibility between RHEL's IB and clients running OFED. I will pass my findings along to WC. Since OFED is so popular on the cluster node side of the equation perhaps WC might consider doing an OFED variant of the precompiled Lustre packages. Just a thought. |
| Comment by Isaac Huang (Inactive) [ 18/Sep/12 ] |
|
It could be hard to find out which official OFED version a RHEL kernel was based on. The kernel merge windows and OFED release cycles don't often match and very often Redhat would backport fixes from newer versions of OFED. I think it's a good idea for clients and servers to at least use IB stacks from a same source, e.g. Open Fabrics or Mellanox or Redhat - there's at least change logs to look for differences. |
| Comment by John Fuchs-Chesney (Inactive) [ 08/Mar/14 ] |
|
Looks like this issue has been resolved and no further action is required. |