[LU-9461] lustre client mount fail after update IB driver and Lustre patch. Created: 08/May/17  Updated: 18/Sep/17  Resolved: 30/Aug/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: sebg-crd-pm (Inactive) Assignee: Amir Shehata (Inactive)
Resolution: Duplicate Votes: 1
Labels: llnl
Environment:

CentOS7.3
Lustre 2.9.0 + cherry-picked as e4297ef38561f1e788ba73ca0c8078a09dc8c303
MLNX_OFED_LINUX-4.0-2.0.0.1-rhel7.3
IB: Mellanox ConnectX-4 adapter EDR


Issue Links:
Related
is related to LU-9500 MOFED 4/mlx5: Aligning non-aligned pa... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

The lustre client mount fail after update IB driver and Lustre client patch(LU-9026).
Should I apply any other patch for new IB driver?

mount fail error mesage
[ 5713.280039] LNet: Using FastReg for registration
[ 5713.370689] LNet: Added LNI 192.168.2.220@o2ib0 [8/256/0/180]
[ 5736.543149] LNetError: 0:0:(o2iblnd_cb.c:3436:kiblnd_qp_event()) 192.168.2.201@o2ib0: Async QP event type 3
[ 5743.539710] Lustre: 15524:0:(client.c:2111:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1493897510/real 1493897510] req@ffff881011ef0300 x1566465051328608/t0(0) o503->MGC192.168.2.201@o2ib0@192.168.2.201@o2ib0:26/25 lens 272/8416 e 0 to 1 dl 1493897517 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
[ 5743.539728] LustreError: 166-1: MGC192.168.2.201@o2ib0: Connection to MGS (at 192.168.2.201@o2ib0) was lost; in progress operations using this service will fail
[ 5743.539899] LustreError: 15c-8: MGC192.168.2.201@o2ib0: The configuration from log 'hpcfs-client' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
[ 5743.540282] Lustre: Unmounted hpcfs-client
[ 5743.544658] LustreError: 15524:0:(obd_mount.c:1449:lustre_fill_super()) Unable to mount (-5)



 Comments   
Comment by Peter Jones [ 08/May/17 ]

Hi there

The LU-9026 patch will allow building with MOFED 4.0 but there are still known issues attempting to run.

Doug

Do you have any suggestions here

Peter

Comment by Doug Oucharek (Inactive) [ 09/May/17 ]

The "Async QP event type 3" is a IB_EVENT_QP_ACCESS_ERR.  This error will stop the connection from continuing and explains all the errors which follow in your logs.

This error happens when a call to ib_create_qp() fails.  It could fail if the version of the parameters being passed in is wrong (i.e. the size of the parameter structure is incorrect).  This could be related to the other issue I am currently working on that involves MOFED 4.  If MOFED 4 has changed the structure we use as a parameter to this call and we have not adapted to that change, we could see an error like this.

Does this error happen on each mount attempt or was this a one off?

Comment by sebg-crd-pm (Inactive) [ 10/May/17 ]

This error happen on each mount attempt.(the test lustre filesystem servers OFED is 3.4)
And "Async QP event type 3" also happened when lustre servers mount mgs/mds/....with OFED4.

Comment by Doug Oucharek (Inactive) [ 16/May/17 ]

I ran into this "Async" error at the same time as the issues I talk about in LU-9500.  They are related.  When I have a fix for LU-9500, this issue will be addressed as well.

Comment by Li Dongyang (Inactive) [ 17/May/17 ]

I think it's my fault.
Could you try this patch on top of LU-9472 and LU-9500?

diff --git a/lnet/klnds/o2iblnd/o2iblnd.c b/lnet/klnds/o2iblnd/o2iblnd.c
index 047fe3c..ba7829b 100644
--- a/lnet/klnds/o2iblnd/o2iblnd.c
+++ b/lnet/klnds/o2iblnd/o2iblnd.c
@@ -1900,8 +1900,6 @@ again:
                                        return n < 0 ? n : -EINVAL;
                                }
 
-                               mr->iova = iov;
-
                                wr = &frd->frd_fastreg_wr;
                                memset(wr, 0, sizeof(*wr));


Comment by Doug Oucharek (Inactive) [ 17/May/17 ]

I'm curious as to why you would not want to set the mr->iova value?  Is this an unneeded step?

Comment by Li Dongyang (Inactive) [ 18/May/17 ]

Hi Doug, 

I believe the issue only applies to mlx5 cards using MOFED4.

in MOFED4, mr->iova is set by ib_map_mr_sg()->ib_sg_to_pages()

It doesn't make sense to reset mr->iova after calling ib_map_mr_sg().

That line of code was introduced to address an similar issue, see the comments on

https://review.whamcloud.com/#/c/19168/

I've done some testing using MOFED4 + lustre-release with mlx4 cards forcing fast reg as well, so far I've seen no problems.

Comment by sebg-crd-pm (Inactive) [ 18/May/17 ]

I can mount lustre ok ( 2.9.57_69_g0bc1964 + LU-9472 / LU-9500 and this patch (- mr->iova = iov )

Is it ok to apply only LU-9026 + LU-9472 / LU-9500 and (- mr->iova = iov patchs to 2.9.0?
Sould I apply any other patchs to Lustre2.9.0 for mlx5 cards using MOFED4 ? Thanks for your suggestion

Comment by Doug Oucharek (Inactive) [ 18/May/17 ]

Li: Thank you for the information.  I checked and you are right, ib_map_mr_sg() does set mr->iova so that line is not needed (could cause a problem).  I will update LU-9500 with a removal of that line.

 sebg-crd-pm: Correct, you only need --LU-9026-- + LU-9500 (with the removal of setting mr->iova) + LU-9472.

Comment by Giuseppe Di Natale (Inactive) [ 29/Aug/17 ]

We have just encountered this issue as well. Is it possible to have LU-9026, LU-9500, and LU-9472 backported to the lustre 2.5 and 2.8 branches?

Comment by Peter Jones [ 30/Aug/17 ]

dinatale2 can you please open a new ticket to track this request?

Comment by sebg-crd-pm (Inactive) [ 08/Sep/17 ]

Hi,

I got create striped directory error in Lustre 2.10 with LU-9500 patch (https://review.whamcloud.com/#/c/28237/) for OFED4.0

[root@hsm client]# lfs mkdir -c 2 dir1
error on LL_IOC_LMV_SETSTRIPE 'dir1' (3): Input/output error
error: mkdir: create stripe dir 'dir1' failed

Should I apply any other patch for this issue? Thanks

Comment by Peter Jones [ 08/Sep/17 ]

This issue is being tracked under LU-9958

Generated at Sat Feb 10 02:26:24 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.