[LU-10252] backport change LU-5718 change 12451/12 to b2_8_fe Created: 16/Nov/17 Updated: 20/Dec/17 Resolved: 07/Dec/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Olaf Faaland | Assignee: | Sonia Sharma (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | llnl | ||
| Environment: |
fs/lustre-release-fe |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
We are encountering the issue described above on clients running lustre 2.8 with Omni-Path fabrics. We would like the patch backported to b2_8_fe. Console log messages 2017-11-12 10:20:23 [763673.420307] LNetError: 6383:0:(o2iblnd_cb.c:1105:kiblnd_init_rdma()) RDMA has too many fragments for peer 192.168.134.10@o2ib27 (256), src idx/frags: 128/256 dst idx/frags: 128/256 2017-11-12 10:20:23 [763673.438365] LNetError: 6383:0:(o2iblnd_cb.c:434:kiblnd_handle_rx()) Can't setup rdma for PUT to 192.168.134.10@o2ib27: -90 2017-11-12 10:23:00 [763830.403553] Lustre: 8245:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1510510823/real 1510510823] req@ffff88102df6e300 x1583085670240648/t0(0) o4->lsh-OST0009-osc-ffff881035fb1000@172.19.3.26@o2ib600:6/4 lens 608/448 e 2 to 1 dl 1510510980 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 2017-11-12 10:23:00 [763830.435966] Lustre: lsh-OST0009-osc-ffff881035fb1000: Connection to lsh-OST0009 (at 172.19.3.26@o2ib600) was lost; in progress operations using this service will wait for recovery to complete 2017-11-12 10:23:00 [763830.455086] Lustre: Skipped 42 previous similar messages 2017-11-12 10:23:00 [763830.488005] Lustre: lsh-OST0009-osc-ffff881035fb1000: Connection restored to 172.19.3.26@o2ib600 (at 172.19.3.26@o2ib600) |
| Comments |
| Comment by Olaf Faaland [ 16/Nov/17 ] |
|
We see this in Lustre: Build Version: 2.8.0_8.chaos. One of our larger clusters (~2600 clients) was moved to Lustre 2.8 recently. We expect to see many more instances of this based on the frequency we see on a small (~128 client) test cluster. The first question is whether the patch mentioned, https://review.whamcloud.com/#/c/12451/12 is the complete fix for this, or whether other patches are required. The second question is whether this is a low risk backport (it seems to me that it is, but you would know better). Thanks. |
| Comment by Olaf Faaland [ 16/Nov/17 ] |
|
I can't figure out how to Link this ticket to |
| Comment by Brad Hoagland (Inactive) [ 17/Nov/17 ] |
|
Hi Amir, Can you help with this one? Thanks, Brad |
| Comment by Sonia Sharma (Inactive) [ 20/Nov/17 ] |
|
Here is the link to the patch for - https://review.whamcloud.com/#/c/30184/ Please note it needs to be accompanied by the fix from - |
| Comment by Joseph Gmitter (Inactive) [ 07/Dec/17 ] |
|
Patch is back ported and appears in Sonia's comment above. Is anything else needed here? |