[LU-7385] Bulk IO write error Created: 04/Nov/15  Updated: 17/Apr/17  Resolved: 17/Apr/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: alyona romanenko (Inactive) Assignee: Doug Oucharek (Inactive)
Resolution: Duplicate Votes: 0
Labels: None

Issue Links:
Duplicate
is duplicated by LU-5718 RDMA too fragmented with router Resolved
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Comments   
Comment by alyona romanenko (Inactive) [ 04/Nov/15 ]

The issue which reported our customer is bulk write from client node(s) repeatedly fail resulting in a dropped connection.
The connection is restored, the bulk write is attempted again, and again fails.
Ultimately the filesystem stops responding to the client node.

> Jun 10 06:54:15 snx11026n010 kernel: LustreError: 106837:0:(events.c:393:server_bulk_callback()) event type 2, status -90, desc ffff88023c269000
> Jun 10 06:54:15 snx11026n010 kernel: LustreError: 24204:0:(ldlm_lib.c:2953:target_bulk_io()) @@@ network error on bulk GET 0(527648)  req@ffff8802da0f6050 x1501695686238744/t0(0) o4->028fa37e-9825-10ed-52ee-9971d416f647@532@gni:0/0 lens 488/416 e 0 to 0 dl 1433912110 ref 1 fl Interpret:/0/0 rc 0/0
> Jun 10 06:54:15 snx11026n010 kernel: Lustre: snx11026-OST001b: Bulk IO write error with 028fa37e-9825-10ed-52ee-9971d416f647 (at 532@gni), client will retry: rc -110
Comment by alyona romanenko (Inactive) [ 05/Nov/15 ]

Hi all,
Test setup for bug reproducing:
client1 (sjsc-34) - router (sich-33) - client2 (pink05).
Lustre version 2.5.1 + patch http://review.whamcloud.com/#/c/12496/ (LU-5718 lnet: add offset for selftest brw).
Also, the session_features was set to LST_FEATS_MASK.
The script:
lst new_session tst
lst add_group pink05 172.18.56.133@o2ib
lst add_group sjsc34 172.24.62.76@o2ib1
lst add_batch test
lst add_test --batch test --loop 5 --to pink05 --from sjsc34 brw write size=1000k off=64
lst run test
lst stat pink05 & /bin/sleep 5; kill $!
lst end_session

the test constantly reproduces the error -90:
router: kiblnd_init_rdma()) RDMA too fragmented for 172.24.62.74@o2ib1 (256): 128/251 src 128/250 dst frags

thanks,
Alyona

Comment by alyona romanenko (Inactive) [ 05/Nov/15 ]

the patch is on the http://review.whamcloud.com/#/c/16141/
The issue was not reproduced with different sizes and offsets if path applied.

Comment by Doug Oucharek (Inactive) [ 15/Jan/16 ]

I updated the patch to include Andreas's suggestions and also fixed a problem with socklnd. The patch worked for o2iblnd, but broke socklnd whenever the message size is greater than 16K. When the message is large, socklnd wants to use zero-copy to send the message. Zero-copy uses tcp_sendpage() which will crash if the kiov passed to it is bigger than PAGE_SIZE. My fix is to avoid zero-copy when the kiov is larger than PAGE_SIZE.

Comment by Doug Oucharek (Inactive) [ 15/Jan/16 ]

James: Would you be able to verify that having kiov's bigger than PAGE_SIZE does not break gnilnd?

Comment by Doug Oucharek (Inactive) [ 01/Feb/16 ]

I am still working on a better patch for this issue, but have come to ask a more fundamental question: how is this situation happening with Lustre (rather than modified lnet_selftest)? How is an offset of 64 happening? Is this due to a partial I/O write? Lustre developers have told me that should not be happening.

Comment by Doug Oucharek (Inactive) [ 02/Nov/16 ]

After playing around with the lnet-selftest change to reproduce this issue (using an offset of 64), I have found this issue is not specific to LNet routers. A client sending directly to a server will also fail. So this LNet router-specific fix will only address part of the problem. The fix for LU-5718, when applied to all nodes and activated, will address the entire problem making it the better solution.

Comment by Peter Jones [ 17/Apr/17 ]

If I understand corrently, this is believed to be a duplicate of LU-5718

Generated at Sat Feb 10 02:08:27 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.