[LU-10252] backport change LU-5718 change 12451/12 to b2_8_fe Created: 16/Nov/17  Updated: 20/Dec/17  Resolved: 07/Dec/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Olaf Faaland Assignee: Sonia Sharma (Inactive)
Resolution: Fixed Votes: 0
Labels: llnl
Environment:

fs/lustre-release-fe


Issue Links:
Related
is related to LU-5718 RDMA too fragmented with router Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

We are encountering the issue described above on clients running lustre 2.8 with Omni-Path fabrics.  We would like the patch backported to b2_8_fe.

Console log messages

2017-11-12 10:20:23 [763673.420307] LNetError: 6383:0:(o2iblnd_cb.c:1105:kiblnd_init_rdma()) RDMA has too many fragments for peer 192.168.134.10@o2ib27 (256), src idx/frags: 128/256 dst idx/frags: 128/256
2017-11-12 10:20:23 [763673.438365] LNetError: 6383:0:(o2iblnd_cb.c:434:kiblnd_handle_rx()) Can't setup rdma for PUT to 192.168.134.10@o2ib27: -90
2017-11-12 10:23:00 [763830.403553] Lustre: 8245:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1510510823/real 1510510823] req@ffff88102df6e300 x1583085670240648/t0(0) o4->lsh-OST0009-osc-ffff881035fb1000@172.19.3.26@o2ib600:6/4 lens 608/448 e 2 to 1 dl 1510510980 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
2017-11-12 10:23:00 [763830.435966] Lustre: lsh-OST0009-osc-ffff881035fb1000: Connection to lsh-OST0009 (at 172.19.3.26@o2ib600) was lost; in progress operations using this service will wait for recovery to complete
2017-11-12 10:23:00 [763830.455086] Lustre: Skipped 42 previous similar messages
2017-11-12 10:23:00 [763830.488005] Lustre: lsh-OST0009-osc-ffff881035fb1000: Connection restored to 172.19.3.26@o2ib600 (at 172.19.3.26@o2ib600)


 Comments   
Comment by Olaf Faaland [ 16/Nov/17 ]

We see this in Lustre: Build Version: 2.8.0_8.chaos.

One of our larger clusters (~2600 clients) was moved to Lustre 2.8 recently.  We expect to see many more instances of this based on the frequency we see on a small (~128 client) test cluster.

The first question is whether the patch mentioned, https://review.whamcloud.com/#/c/12451/12 is the complete fix for this, or whether other patches are required.

The second question is whether this is a low risk backport (it seems to me that it is, but you would know better).

Thanks.

Comment by Olaf Faaland [ 16/Nov/17 ]

I can't figure out how to Link this ticket to LU-5718 so am noting that here.

Comment by Brad Hoagland (Inactive) [ 17/Nov/17 ]

Hi Amir,

Can you help with this one?

Thanks,

Brad

Comment by Sonia Sharma (Inactive) [ 20/Nov/17 ]

Here is the link to the patch for -LU-5718- added to b2_8_fe. It still needs to be tested and reviewed.

https://review.whamcloud.com/#/c/30184/

Please note it needs to be accompanied by the fix from -LU-9420 and LU-9425-. Backported patch for these -
https://review.whamcloud.com/#/c/30185/

https://review.whamcloud.com/#/c/30284/

Comment by Joseph Gmitter (Inactive) [ 07/Dec/17 ]

Patch is back ported and appears in Sonia's comment above. Is anything else needed here?

Generated at Sat Feb 10 02:33:24 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.