[LU-7385] Bulk IO write error Created: 04/Nov/15 Updated: 17/Apr/17 Resolved: 17/Apr/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | alyona romanenko (Inactive) | Assignee: | Doug Oucharek (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Comments |
| Comment by alyona romanenko (Inactive) [ 04/Nov/15 ] |
|
The issue which reported our customer is bulk write from client node(s) repeatedly fail resulting in a dropped connection. > Jun 10 06:54:15 snx11026n010 kernel: LustreError: 106837:0:(events.c:393:server_bulk_callback()) event type 2, status -90, desc ffff88023c269000 > Jun 10 06:54:15 snx11026n010 kernel: LustreError: 24204:0:(ldlm_lib.c:2953:target_bulk_io()) @@@ network error on bulk GET 0(527648) req@ffff8802da0f6050 x1501695686238744/t0(0) o4->028fa37e-9825-10ed-52ee-9971d416f647@532@gni:0/0 lens 488/416 e 0 to 0 dl 1433912110 ref 1 fl Interpret:/0/0 rc 0/0 > Jun 10 06:54:15 snx11026n010 kernel: Lustre: snx11026-OST001b: Bulk IO write error with 028fa37e-9825-10ed-52ee-9971d416f647 (at 532@gni), client will retry: rc -110 |
| Comment by alyona romanenko (Inactive) [ 05/Nov/15 ] |
|
Hi all, the test constantly reproduces the error -90: thanks, |
| Comment by alyona romanenko (Inactive) [ 05/Nov/15 ] |
|
the patch is on the http://review.whamcloud.com/#/c/16141/ |
| Comment by Doug Oucharek (Inactive) [ 15/Jan/16 ] |
|
I updated the patch to include Andreas's suggestions and also fixed a problem with socklnd. The patch worked for o2iblnd, but broke socklnd whenever the message size is greater than 16K. When the message is large, socklnd wants to use zero-copy to send the message. Zero-copy uses tcp_sendpage() which will crash if the kiov passed to it is bigger than PAGE_SIZE. My fix is to avoid zero-copy when the kiov is larger than PAGE_SIZE. |
| Comment by Doug Oucharek (Inactive) [ 15/Jan/16 ] |
|
James: Would you be able to verify that having kiov's bigger than PAGE_SIZE does not break gnilnd? |
| Comment by Doug Oucharek (Inactive) [ 01/Feb/16 ] |
|
I am still working on a better patch for this issue, but have come to ask a more fundamental question: how is this situation happening with Lustre (rather than modified lnet_selftest)? How is an offset of 64 happening? Is this due to a partial I/O write? Lustre developers have told me that should not be happening. |
| Comment by Doug Oucharek (Inactive) [ 02/Nov/16 ] |
|
After playing around with the lnet-selftest change to reproduce this issue (using an offset of 64), I have found this issue is not specific to LNet routers. A client sending directly to a server will also fail. So this LNet router-specific fix will only address part of the problem. The fix for |
| Comment by Peter Jones [ 17/Apr/17 ] |
|
If I understand corrently, this is believed to be a duplicate of |