Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Attachments

      Issue Links

        Activity

          [LU-7385] Bulk IO write error
          pjones Peter Jones added a comment -

          If I understand corrently, this is believed to be a duplicate of LU-5718

          pjones Peter Jones added a comment - If I understand corrently, this is believed to be a duplicate of LU-5718

          After playing around with the lnet-selftest change to reproduce this issue (using an offset of 64), I have found this issue is not specific to LNet routers. A client sending directly to a server will also fail. So this LNet router-specific fix will only address part of the problem. The fix for LU-5718, when applied to all nodes and activated, will address the entire problem making it the better solution.

          doug Doug Oucharek (Inactive) added a comment - After playing around with the lnet-selftest change to reproduce this issue (using an offset of 64), I have found this issue is not specific to LNet routers. A client sending directly to a server will also fail. So this LNet router-specific fix will only address part of the problem. The fix for LU-5718 , when applied to all nodes and activated, will address the entire problem making it the better solution.

          I am still working on a better patch for this issue, but have come to ask a more fundamental question: how is this situation happening with Lustre (rather than modified lnet_selftest)? How is an offset of 64 happening? Is this due to a partial I/O write? Lustre developers have told me that should not be happening.

          doug Doug Oucharek (Inactive) added a comment - I am still working on a better patch for this issue, but have come to ask a more fundamental question: how is this situation happening with Lustre (rather than modified lnet_selftest)? How is an offset of 64 happening? Is this due to a partial I/O write? Lustre developers have told me that should not be happening.

          James: Would you be able to verify that having kiov's bigger than PAGE_SIZE does not break gnilnd?

          doug Doug Oucharek (Inactive) added a comment - James: Would you be able to verify that having kiov's bigger than PAGE_SIZE does not break gnilnd?

          I updated the patch to include Andreas's suggestions and also fixed a problem with socklnd. The patch worked for o2iblnd, but broke socklnd whenever the message size is greater than 16K. When the message is large, socklnd wants to use zero-copy to send the message. Zero-copy uses tcp_sendpage() which will crash if the kiov passed to it is bigger than PAGE_SIZE. My fix is to avoid zero-copy when the kiov is larger than PAGE_SIZE.

          doug Doug Oucharek (Inactive) added a comment - I updated the patch to include Andreas's suggestions and also fixed a problem with socklnd. The patch worked for o2iblnd, but broke socklnd whenever the message size is greater than 16K. When the message is large, socklnd wants to use zero-copy to send the message. Zero-copy uses tcp_sendpage() which will crash if the kiov passed to it is bigger than PAGE_SIZE. My fix is to avoid zero-copy when the kiov is larger than PAGE_SIZE.

          the patch is on the http://review.whamcloud.com/#/c/16141/
          The issue was not reproduced with different sizes and offsets if path applied.

          aromanenko alyona romanenko (Inactive) added a comment - the patch is on the http://review.whamcloud.com/#/c/16141/ The issue was not reproduced with different sizes and offsets if path applied.

          Hi all,
          Test setup for bug reproducing:
          client1 (sjsc-34) - router (sich-33) - client2 (pink05).
          Lustre version 2.5.1 + patch http://review.whamcloud.com/#/c/12496/ (LU-5718 lnet: add offset for selftest brw).
          Also, the session_features was set to LST_FEATS_MASK.
          The script:
          lst new_session tst
          lst add_group pink05 172.18.56.133@o2ib
          lst add_group sjsc34 172.24.62.76@o2ib1
          lst add_batch test
          lst add_test --batch test --loop 5 --to pink05 --from sjsc34 brw write size=1000k off=64
          lst run test
          lst stat pink05 & /bin/sleep 5; kill $!
          lst end_session

          the test constantly reproduces the error -90:
          router: kiblnd_init_rdma()) RDMA too fragmented for 172.24.62.74@o2ib1 (256): 128/251 src 128/250 dst frags

          thanks,
          Alyona

          aromanenko alyona romanenko (Inactive) added a comment - Hi all, Test setup for bug reproducing: client1 (sjsc-34) - router (sich-33) - client2 (pink05). Lustre version 2.5.1 + patch http://review.whamcloud.com/#/c/12496/ ( LU-5718 lnet: add offset for selftest brw). Also, the session_features was set to LST_FEATS_MASK. The script: lst new_session tst lst add_group pink05 172.18.56.133@o2ib lst add_group sjsc34 172.24.62.76@o2ib1 lst add_batch test lst add_test --batch test --loop 5 --to pink05 --from sjsc34 brw write size=1000k off=64 lst run test lst stat pink05 & /bin/sleep 5; kill $! lst end_session the test constantly reproduces the error -90: router: kiblnd_init_rdma()) RDMA too fragmented for 172.24.62.74@o2ib1 (256): 128/251 src 128/250 dst frags thanks, Alyona

          The issue which reported our customer is bulk write from client node(s) repeatedly fail resulting in a dropped connection.
          The connection is restored, the bulk write is attempted again, and again fails.
          Ultimately the filesystem stops responding to the client node.

          > Jun 10 06:54:15 snx11026n010 kernel: LustreError: 106837:0:(events.c:393:server_bulk_callback()) event type 2, status -90, desc ffff88023c269000
          > Jun 10 06:54:15 snx11026n010 kernel: LustreError: 24204:0:(ldlm_lib.c:2953:target_bulk_io()) @@@ network error on bulk GET 0(527648)  req@ffff8802da0f6050 x1501695686238744/t0(0) o4->028fa37e-9825-10ed-52ee-9971d416f647@532@gni:0/0 lens 488/416 e 0 to 0 dl 1433912110 ref 1 fl Interpret:/0/0 rc 0/0
          > Jun 10 06:54:15 snx11026n010 kernel: Lustre: snx11026-OST001b: Bulk IO write error with 028fa37e-9825-10ed-52ee-9971d416f647 (at 532@gni), client will retry: rc -110
          
          aromanenko alyona romanenko (Inactive) added a comment - The issue which reported our customer is bulk write from client node(s) repeatedly fail resulting in a dropped connection. The connection is restored, the bulk write is attempted again, and again fails. Ultimately the filesystem stops responding to the client node. > Jun 10 06:54:15 snx11026n010 kernel: LustreError: 106837:0:(events.c:393:server_bulk_callback()) event type 2, status -90, desc ffff88023c269000 > Jun 10 06:54:15 snx11026n010 kernel: LustreError: 24204:0:(ldlm_lib.c:2953:target_bulk_io()) @@@ network error on bulk GET 0(527648) req@ffff8802da0f6050 x1501695686238744/t0(0) o4->028fa37e-9825-10ed-52ee-9971d416f647@532@gni:0/0 lens 488/416 e 0 to 0 dl 1433912110 ref 1 fl Interpret:/0/0 rc 0/0 > Jun 10 06:54:15 snx11026n010 kernel: Lustre: snx11026-OST001b: Bulk IO write error with 028fa37e-9825-10ed-52ee-9971d416f647 (at 532@gni), client will retry: rc -110

          People

            doug Doug Oucharek (Inactive)
            aromanenko alyona romanenko (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            12 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: