Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12856

LustreError: 82937:0:(ldlm_lib.c:3268:target_bulk_io()) @@@ truncated bulk READ 0(270336)

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.14.0, Lustre 2.12.4
    • Lustre 2.12.2
    • None
    • L2.12.2 server & L2.10.8 Server
      L2.12.2 client & L2.11 Client
    • 2
    • 9223372036854775807

    Description

      We are seeing

       Oct 14 07:51:06 nbp7-oss7 kernel: [1110415.124675] LNet: 72766:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) PUT_NACK from 10.151.54.10@o2ib
      Oct 14 07:53:28 nbp7-oss7 kernel: [1110557.738283] LustreError: 48242:0:(ldlm_lib.c:3268:target_bulk_io()) @@@ truncated bulk READ 0(270336)  req@ffff8bd7711d9850 x1647322708401808/t0(0) o3->c1bb3af1-465f-4e4e-3b45-831f7f0aa442@10.151.53.134@o2ib:520/0 lens 488/440 e 0 to 0 dl 1571064920 ref 1 fl Interpret:/2/0 rc 0/0
      Oct 14 07:53:28 nbp7-oss7 kernel: [1110557.820414] Lustre: nbp7-OST0009: Bulk IO read error with c1bb3af1-465f-4e4e-3b45-831f7f0aa442 (at 10.151.53.134@o2ib), client will retry: rc -110
      Oct 14 07:58:40 nbp7-oss7 kernel: [1110869.753867] LNet: 72764:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) PUT_NACK from 10.151.54.10@o2ib
      Oct 14 08:00:58 nbp7-oss7 kernel: [1111007.747557] LustreError: 7338:0:(ldlm_lib.c:3268:target_bulk_io()) @@@ truncated bulk READ 0(270336)  req@ffff8bd4ef6c3050 x1647322708401808/t0(0) o3->c1bb3af1-465f-4e4e-3b45-831f7f0aa442@10.151.53.134@o2ib:220/0 lens 488/440 e 0 to 0 dl 1571065375 ref 1 fl Interpret:/2/0 rc 0/0
      Oct 14 08:00:58 nbp7-oss7 kernel: [1111007.829410] Lustre: nbp7-OST0009: Bulk IO read error with c1bb3af1-465f-4e4e-3b45-831f7f0aa442 (at 10.151.53.134@o2ib), client will retry: rc -110
      Oct 14 08:02:39 nbp7-oss7 kernel: [1111108.765855] Lustre: nbp7-OST0003: Client c1bb3af1-465f-4e4e-3b45-831f7f0aa442 (at 10.151.53.134@o2ib) reconnecting
      Oct 14 08:02:39 nbp7-oss7 kernel: [1111108.800504] Lustre: Skipped 5 previous similar messages
      Oct 14 08:02:39 nbp7-oss7 kernel: [1111108.818286] Lustre: nbp7-OST0003: Connection restored to 2ee3c4e1-9fd7-3338-5bf8-d1f02bcd8a20 (at 10.151.53.134@o2ib)
      Oct 14 08:02:39 nbp7-oss7 kernel: [1111108.818288] Lustre: Skipped 5 previous similar messages
      Oct 14 08:09:47 nbp7-oss7 kernel: [1111536.849491] LNet: 72766:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) PUT_NACK from 10.151.54.10@o2ib
      Oct 14 08:11:48 nbp7-oss7 kernel: [1111657.759009] LustreError: 82937:0:(ldlm_lib.c:3268:target_bulk_io()) @@@ truncated bulk READ 0(270336)  req@ffff8bd724b11850 x1647322708401808/t0(0) o3->c1bb3af1-465f-4e4e-3b45-831f7f0aa442@10.151.53.134@o2ib:132/0 lens 488/440 e 0 to 0 dl 1571066042 ref 1 fl Interpret:/2/0 rc 0/0
      Oct 14 08:11:48 nbp7-oss7 kernel: [1111657.841189] Lustre: nbp7-OST0009: Bulk IO read error with c1bb3af1-465f-4e4e-3b45-831f7f0aa442 (at 10.151.53.134@o2ib), client will retry: rc -110
      Oct 14 08:12:57 nbp7-oss7 kernel: [1111726.231189] Lustre: nbp7-OST000f: Client c1bb3af1-465f-4e4e-3b45-831f7f0aa442 (at 10.151.53.134@o2ib) reconnecting
      Oct 14 08:12:57 nbp7-oss7 kernel: [1111726.265819] Lustre: Skipped 5 previous similar messages
      

      We have the 3 patch reported in LU-12385 and LU-12772 applied to both client and server.
      Client Configs

      r593i6n16 ~ # lnetctl global show
      global:
          numa_range: 0
          max_intf: 200
          discovery: 1
          drop_asym_route: 0
          retry_count: 0
          transaction_timeout: 100
          health_sensitivity: 0
          recovery_interval: 1
      ----
      options ko2iblnd require_privileged_port=0 use_privileged_port=0
      options ko2iblnd timeout=100 retry_count=7 peer_timeout=0 map_on_demand=32 peer_credits=32 concurrent_sends=32
      options lnet networks=o2ib(ib1)
      options lnet avoid_asym_router_failure=1 check_routers_before_use=1 small_router_buffers=65536 large_router_buffers=8192
      options lnet lnet_transaction_timeout=100
      options ptlrpc at_max=600 at_min=200
      

      Server Configs

      nbp7-oss7 ~ # lnetctl global show
      global:
          numa_range: 0
          max_intf: 200
          discovery: 1
          drop_asym_route: 0
          retry_count: 0
          transaction_timeout: 100
          health_sensitivity: 0
          recovery_interval: 1
      ----
      options ko2iblnd require_privileged_port=0 use_privileged_port=0
      options ko2iblnd ntx=251072 credits=125536 fmr_pool_size=62769 
      options ko2iblnd timeout=100 retry_count=7 peer_timeout=0 map_on_demand=32 peer_credits=32 concurrent_sends=32
      
      
      #lnet
      options lnet networks=o2ib(ib1) 
      options lnet routes="o2ib233 10.151.26.[80-94]@o2ib; o2ib313 10.151.25.[195-197,202-205,222]@o2ib 10.151.26.[60,127,140-144,150-154]@o2ib; o2ib417 10.151.26.[148,149]@o2ib 10.151.25.[167-170]@o2ib"
      options lnet dead_router_check_interval=60 live_router_check_interval=30
      options lnet avoid_asym_router_failure=1 check_routers_before_use=1 small_router_buffers=65536 large_router_buffers=8192options lnet lnet_transaction_timeout=100
      options ptlrpc at_max=600 at_min=200
      

      These bulk errors are a major issue. We though matching client and server to 2.12.2 would solve it but doesn't look like it.

      Attachments

        1. 0001-LU-12856-debug.patch
          2 kB
        2. 0002-LU-12856-revert-LU-9983.patch
          2 kB
        3. client.10.151.53.134.debug.gz
          805 kB
        4. nbp7-mds.2019-10-16.out.gz
          65.90 MB
        5. r407i2n14.out2.gz
          50.70 MB
        6. server.xid.1647322708401808.debug.gz
          78.16 MB

        Issue Links

          Activity

            [LU-12856] LustreError: 82937:0:(ldlm_lib.c:3268:target_bulk_io()) @@@ truncated bulk READ 0(270336)

            Turning of discovery didn't help. I was able to reproduce with 2.12.2 client and 2.10.6 server.

             

            mhanafi Mahmoud Hanafi added a comment - Turning of discovery didn't help. I was able to reproduce with 2.12.2 client and 2.10.6 server.  

            Can we try something, just to eliminate one of the variables.
            in 2.12.2 client can you disable discovery:

            options lnet lnet_peer_discovery_disabled=1

            or vial lnetct

            lnetctl set discovery 0 

            and try and reproduce with the 2.12.2 clients/2.10.x servers.

            One of the features which made it in 2.11 and up is lnet discovery and I want to ensure that it's not introducing an issue.

            ashehata Amir Shehata (Inactive) added a comment - Can we try something, just to eliminate one of the variables. in 2.12.2 client can you disable discovery: options lnet lnet_peer_discovery_disabled=1 or vial lnetct lnetctl set discovery 0 and try and reproduce with the 2.12.2 clients/2.10.x servers. One of the features which made it in 2.11 and up is lnet discovery and I want to ensure that it's not introducing an issue.
            mhanafi Mahmoud Hanafi added a comment - - edited
            2.10.8 client (sles12sp3) / 2.10.6 server (cent7) - no problem
            2.10.8 client (sles12sp3) / 2.10.8 server (cent7) - no problem
            2.10.8 client (sles12sp3) / 2.12.2 server  (cent7) - no problem
            2.11 client (sles12sp3)   / 2.10.6 server  (cent7)  - problem
            2.12.2 client (sles12sp4) / 2.10.6 server  (cent7)  - problem
            2.12.2 client (sles12sp4)  / 2.10.8 server  (cent7) - problem
            2.12.2 client  (sles12sp4) / 2.12.2 server  (cent7) - probem

            Correct past 2.10.8 client is the issue.

            mhanafi Mahmoud Hanafi added a comment - - edited 2.10.8 client (sles12sp3) / 2.10.6 server (cent7) - no problem 2.10.8 client (sles12sp3) / 2.10.8 server (cent7) - no problem 2.10.8 client (sles12sp3) / 2.12.2 server (cent7) - no problem 2.11 client (sles12sp3) / 2.10.6 server (cent7) - problem 2.12.2 client (sles12sp4) / 2.10.6 server (cent7) - problem 2.12.2 client (sles12sp4) / 2.10.8 server (cent7) - problem 2.12.2 client (sles12sp4) / 2.12.2 server (cent7) - probem Correct past 2.10.8 client is the issue.
            ashehata Amir Shehata (Inactive) added a comment - - edited

            I need to clarify the test matrix

            2.10.8 client (sles12sp3) / 2.10.6 server (cent7) - no problem
            2.10.8 client (sles12sp3) / 2.10.8 server (cent7) - no problem
            2.10.8 client (sles12sp3) / 2.12.2 server  (cent7) - no problem
            2.11 client (sles12sp3)   / 2.10.6 server  (cent7)  - problem
            2.12.2 client (sles12sp4) / 2.10.6 server  (cent7)  - problem
            2.12.2 client (sles12sp4)  / 2.10.8 server  (cent7) - problem
            2.12.2 client  (sles12sp4) / 2.12.2 server  (cent7) - probem

            Basically, once we move the clients past 2.10.8 we see the problem?

            Is that correct? Will narrow down the list of changes to check.

            thanks

            amir

            ashehata Amir Shehata (Inactive) added a comment - - edited I need to clarify the test matrix 2.10.8 client (sles12sp3) / 2.10.6 server (cent7) - no problem 2.10.8 client (sles12sp3) / 2.10.8 server (cent7) - no problem 2.10.8 client (sles12sp3) / 2.12.2 server (cent7) - no problem 2.11 client (sles12sp3) / 2.10.6 server (cent7) - problem 2.12.2 client (sles12sp4) / 2.10.6 server (cent7) - problem 2.12.2 client (sles12sp4) / 2.10.8 server (cent7) - problem 2.12.2 client (sles12sp4) / 2.12.2 server (cent7) - probem Basically, once we move the clients past 2.10.8 we see the problem? Is that correct? Will narrow down the list of changes to check. thanks amir
            mhanafi Mahmoud Hanafi added a comment - - edited

            I tried a couple of time to reproduce using a 2.10.8 client and it wouldn't. 

            So looks like the bug was introduce after 2.10.8

            mhanafi Mahmoud Hanafi added a comment - - edited I tried a couple of time to reproduce using a 2.10.8 client and it wouldn't.  So looks like the bug was introduce after 2.10.8

            This should not have been close.

             

            mhanafi Mahmoud Hanafi added a comment - This should not have been close.  
            pjones Peter Jones added a comment -

            ok

            pjones Peter Jones added a comment - ok

            My last test was with 2.10.6 so that would be the case reported in LU-12772. We can close that one as dup and track this issue here.

            mhanafi Mahmoud Hanafi added a comment - My last test was with 2.10.6 so that would be the case reported in LU-12772 . We can close that one as dup and track this issue here.

            You're right that the mbits need to match. The question is why they don't for the same xid. I'll need to dig deeper, as this is not an LNet issue.

            One question though, if we go back to the original problem reported in LU-12772, can we identify that it's the same problem? IE the mbits are 0x0?

            ashehata Amir Shehata (Inactive) added a comment - You're right that the mbits need to match. The question is why they don't for the same xid. I'll need to dig deeper, as this is not an LNet issue. One question though, if we go back to the original problem reported in LU-12772 , can we identify that it's the same problem? IE the mbits are 0x0?
            mhanafi Mahmoud Hanafi added a comment - - edited

            I got it to reproduce with 2.11 client. May be its a server side issue. I am going to try a older server build. I know we never had this issue with 2.7 servers and 2.7 clients.  I'll try a 2.10.6 server.

            I tried 2.10.6 server it reproduced.

            All transfer with mbits 0x0-0x0 endup with

            Bulk timout.

             

            mhanafi Mahmoud Hanafi added a comment - - edited I got it to reproduce with 2.11 client. May be its a server side issue. I am going to try a older server build. I know we never had this issue with 2.7 servers and 2.7 clients.  I'll try a 2.10.6 server. I tried 2.10.6 server it reproduced. All transfer with mbits 0x0-0x0 endup with Bulk timout.  

            Re-ran and got better debugging. Here  xid 1646444722512224 to nid 10.151.7.67@o2ib (r407i0n3) times out. This is client side after reboot of server

            00000100:00080000:0.0F:1571286150.709202:0:3642:0:(client.c:1575:ptlrpc_send_new_req()) @@@ req waiting for recovery: (FULL != DISCONN)  req@ffff99fa7a75cdc0 x1646444722512224/t0(0) o3->nbptest-OST0004-osc-ffff99fa993e8000@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
            00000100:00000200:0.0F:1571286866.293058:0:3642:0:(niobuf.c:884:ptl_send_rpc()) Setup reply buffer: 512 bytes, xid 1646444722514672, portal 25
            00000100:00000200:0.0:1571286866.293067:0:3642:0:(niobuf.c:85:ptl_send_buf()) Sending 224 bytes to portal 26, xid 1646444722514672, offset 0
            00000400:00000200:0.0:1571286866.293074:0:3642:0:(lib-move.c:4608:LNetPut()) LNetPut -> 12345-10.151.27.39@o2ib
            00000400:00000200:0.0:1571286866.293083:0:3642:0:(lib-move.c:2450:lnet_handle_send_case_locked()) Source ANY to NMR:  10.151.27.39@o2ib local destination
            00000400:00000200:0.0:1571286866.293096:0:3642:0:(lib-move.c:1728:lnet_handle_send()) TRACE: 10.151.7.67@o2ib(10.151.7.67@o2ib:<?>) -> 10.151.27.39@o2ib(10.151.27.39@o2ib:10.151.27.39@o2ib) : PUT try# 0
            00000800:00000200:0.0:1571286866.293103:0:3642:0:(o2iblnd_cb.c:1647:kiblnd_send()) sending 224 bytes in 1 frags to 12345-10.151.27.39@o2ib
            00000800:00000200:0.0:1571286866.293110:0:3642:0:(o2iblnd.c:403:kiblnd_find_peer_locked()) got peer_ni [ffff9a021946ac80] -> 10.151.27.39@o2ib (2) version: 12
            00000800:00000200:0.0:1571286866.293113:0:3642:0:(o2iblnd_cb.c:1527:kiblnd_launch_tx()) conn[ffff99fbc7289a00] (68)++
            00000800:00000200:0.0:1571286866.293116:0:3642:0:(o2iblnd_cb.c:1301:kiblnd_queue_tx_locked()) conn[ffff99fbc7289a00] (69)++
            00000800:00000200:0.0:1571286866.293121:0:3642:0:(o2iblnd_cb.c:1533:kiblnd_launch_tx()) conn[ffff99fbc7289a00] (70)--
            00000100:00000200:0.0:1571286866.293338:0:3642:0:(events.c:93:reply_in_callback()) @@@ type 6, status 0  req@ffff99f8119524c0 x1646444722514672/t0(0) o400->MGC10.151.27.39@o2ib@10.151.27.39@o2ib:26/25 lens 224/224 e 0 to 0 dl 1571287321 ref 1 fl Rpc:RN/0/ffffffff rc 0/-1
            00000100:00000200:0.0:1571286866.293350:0:3642:0:(events.c:114:reply_in_callback()) @@@ unlink  req@ffff99f8119524c0 x1646444722514672/t0(0) o400->MGC10.151.27.39@o2ib@10.151.27.39@o2ib:26/25 lens 224/224 e 0 to 0 dl 1571287321 ref 1 fl Rpc:RN/0/ffffffff rc 0/-1
            00000400:00000200:0.0:1571286866.293360:0:3642:0:(lib-md.c:69:lnet_md_unlink()) Unlinking md ffff99fa69f23ee0
            00000100:00080000:11.0:1571286883.004713:0:3637:0:(recover.c:223:ptlrpc_wake_delayed()) @@@ waking (set ffff99fa992b4c00):  req@ffff99fa7a75cdc0 x1646444722512224/t0(0) o3->nbptest-OST0004-osc-ffff99fa993e8000@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
            00000100:00000200:0.0F:1571286883.004777:0:3642:0:(niobuf.c:884:ptl_send_rpc()) Setup reply buffer: 8192 bytes, xid 1646444722512224, portal 4
            00000100:00000200:0.0:1571286883.004785:0:3642:0:(niobuf.c:85:ptl_send_buf()) Sending 488 bytes to portal 6, xid 1646444722512224, offset 0
            00000400:00000200:0.0:1571286883.004793:0:3642:0:(lib-move.c:4608:LNetPut()) LNetPut -> 12345-10.151.27.39@o2ib
            00000400:00000200:0.0:1571286883.004798:0:3642:0:(lib-move.c:2450:lnet_handle_send_case_locked()) Source ANY to NMR:  10.151.27.39@o2ib local destination
            00000400:00000200:0.0:1571286883.004809:0:3642:0:(lib-move.c:1728:lnet_handle_send()) TRACE: 10.151.7.67@o2ib(10.151.7.67@o2ib:<?>) -> 10.151.27.39@o2ib(10.151.27.39@o2ib:10.151.27.39@o2ib) : PUT try# 0
            00000800:00000200:0.0:1571286883.004815:0:3642:0:(o2iblnd_cb.c:1647:kiblnd_send()) sending 488 bytes in 1 frags to 12345-10.151.27.39@o2ib
            00000800:00000200:0.0:1571286883.004830:0:3642:0:(o2iblnd.c:403:kiblnd_find_peer_locked()) got peer_ni [ffff9a021946ac80] -> 10.151.27.39@o2ib (2) version: 12
            00000800:00000200:0.0:1571286883.004832:0:3642:0:(o2iblnd_cb.c:1527:kiblnd_launch_tx()) conn[ffff99fbc7289a00] (68)++
            00000800:00000200:0.0:1571286883.004835:0:3642:0:(o2iblnd_cb.c:1301:kiblnd_queue_tx_locked()) conn[ffff99fbc7289a00] (69)++
            00000800:00000200:0.0:1571286883.004839:0:3642:0:(o2iblnd_cb.c:1533:kiblnd_launch_tx()) conn[ffff99fbc7289a00] (70)--
            00000100:00000200:37.0:1571286883.004870:0:3629:0:(events.c:57:request_out_callback()) @@@ type 5, status 0  req@ffff99fa7a75cdc0 x1646444722512224/t0(0) o3->nbptest-OST0004-osc-ffff99fa993e8000@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 0 dl 1571287337 ref 3 fl Rpc:/0/ffffffff rc 0/-1
            00000100:00000200:1.0:1571286883.086207:0:3642:0:(niobuf.c:429:ptlrpc_register_bulk()) Setup 1 bulk put-sink buffers: 1 pages 4096 bytes, mbits x0x5d96ec8d403b0-0x5d96ec8d403b0, portal 8
            

            HERE is the server side

            00000100:00000200:1.0:1571286883.030498:0:10906:0:(service.c:2094:ptlrpc_server_handle_request()) got req 1646444722512224
            00000400:00000200:3.0:1571286883.030956:0:10906:0:(lib-move.c:2796:LNetPut()) LNetPut -> 12345-10.151.7.67@o2ib
            00000400:00000200:3.0:1571286883.030964:0:10906:0:(lib-move.c:1930:lnet_select_pathway()) TRACE: 10.151.27.39@o2ib(10.151.27.39@o2ib:10.151.27.39@o2ib) -> 10.151.7.67@o2ib(10.151.7.67@o2ib:10.151.7.67@o2ib) : PUT
            00000800:00000200:3.0:1571286883.030968:0:10906:0:(o2iblnd_cb.c:1510:kiblnd_send()) sending 4096 bytes in 1 frags to 12345-10.151.7.67@o2ib
            00000800:00000200:3.0:1571286883.030970:0:10906:0:(o2iblnd_cb.c:703:kiblnd_setup_rd_kiov()) niov 1 offset 0 nob 4096
            00000800:00000200:3.0:1571286883.030973:0:10906:0:(o2iblnd.c:405:kiblnd_find_peer_locked()) got peer_ni [ffff8c3c04acc300] -> 10.151.7.67@o2ib (2) version: 12
            00000800:00000200:3.0:1571286883.030975:0:10906:0:(o2iblnd_cb.c:1391:kiblnd_launch_tx()) conn[ffff8c3ba8064600] (68)++
            00000800:00000200:3.0:1571286883.030976:0:10906:0:(o2iblnd_cb.c:1166:kiblnd_queue_tx_locked()) conn[ffff8c3ba8064600] (69)++
            00000800:00000200:3.0:1571286883.030979:0:10906:0:(o2iblnd_cb.c:1397:kiblnd_launch_tx()) conn[ffff8c3ba8064600] (70)--
            00000100:00000200:3.0:1571286883.030982:0:10906:0:(niobuf.c:262:ptlrpc_start_bulk_transfer()) Transferring 1 pages 4096 bytes via portal 8 id 12345-10.151.7.67@o2ib mbits 0x0-0x0
            00010000:00020000:13.0:1571286983.030947:0:10906:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk READ after 100+0s  req@ffff8c3b6e3ba850 x1646444722512224/t0(0) o3->61da07f7-88f2-b96a-5d40-8a120a8a2bcb@10.151.7.67@o2ib:12/0 lens 488/432 e 0 to 0 dl 1571287137 ref 1 fl Interpret:/0/0 rc 0/0
            

            I don't understand the client setup buffers like this

             put-sink buffers: 1 pages 4096 bytes, mbits x0x5d96ec8d403b0-0x5d96ec8d403b0, portal 8
            

            But server sent

            Transferring 1 pages 4096 bytes via portal 8 id 12345-10.151.7.67@o2ib mbits 0x0-0x0
            

            Doesn't the mbits have to match?

            Btw, my test is doing 4k IO.
            I uploaded all logs to ftp.whamcloud.com/uploads/LU-12856/nasa.debug.logs.tgz

            mhanafi Mahmoud Hanafi added a comment - Re-ran and got better debugging. Here  xid 1646444722512224 to nid 10.151.7.67@o2ib (r407i0n3) times out. This is client side after reboot of server 00000100:00080000:0.0F:1571286150.709202:0:3642:0:(client.c:1575:ptlrpc_send_new_req()) @@@ req waiting for recovery: (FULL != DISCONN) req@ffff99fa7a75cdc0 x1646444722512224/t0(0) o3->nbptest-OST0004-osc-ffff99fa993e8000@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1 00000100:00000200:0.0F:1571286866.293058:0:3642:0:(niobuf.c:884:ptl_send_rpc()) Setup reply buffer: 512 bytes, xid 1646444722514672, portal 25 00000100:00000200:0.0:1571286866.293067:0:3642:0:(niobuf.c:85:ptl_send_buf()) Sending 224 bytes to portal 26, xid 1646444722514672, offset 0 00000400:00000200:0.0:1571286866.293074:0:3642:0:(lib-move.c:4608:LNetPut()) LNetPut -> 12345-10.151.27.39@o2ib 00000400:00000200:0.0:1571286866.293083:0:3642:0:(lib-move.c:2450:lnet_handle_send_case_locked()) Source ANY to NMR: 10.151.27.39@o2ib local destination 00000400:00000200:0.0:1571286866.293096:0:3642:0:(lib-move.c:1728:lnet_handle_send()) TRACE: 10.151.7.67@o2ib(10.151.7.67@o2ib:<?>) -> 10.151.27.39@o2ib(10.151.27.39@o2ib:10.151.27.39@o2ib) : PUT try # 0 00000800:00000200:0.0:1571286866.293103:0:3642:0:(o2iblnd_cb.c:1647:kiblnd_send()) sending 224 bytes in 1 frags to 12345-10.151.27.39@o2ib 00000800:00000200:0.0:1571286866.293110:0:3642:0:(o2iblnd.c:403:kiblnd_find_peer_locked()) got peer_ni [ffff9a021946ac80] -> 10.151.27.39@o2ib (2) version: 12 00000800:00000200:0.0:1571286866.293113:0:3642:0:(o2iblnd_cb.c:1527:kiblnd_launch_tx()) conn[ffff99fbc7289a00] (68)++ 00000800:00000200:0.0:1571286866.293116:0:3642:0:(o2iblnd_cb.c:1301:kiblnd_queue_tx_locked()) conn[ffff99fbc7289a00] (69)++ 00000800:00000200:0.0:1571286866.293121:0:3642:0:(o2iblnd_cb.c:1533:kiblnd_launch_tx()) conn[ffff99fbc7289a00] (70)-- 00000100:00000200:0.0:1571286866.293338:0:3642:0:(events.c:93:reply_in_callback()) @@@ type 6, status 0 req@ffff99f8119524c0 x1646444722514672/t0(0) o400->MGC10.151.27.39@o2ib@10.151.27.39@o2ib:26/25 lens 224/224 e 0 to 0 dl 1571287321 ref 1 fl Rpc:RN/0/ffffffff rc 0/-1 00000100:00000200:0.0:1571286866.293350:0:3642:0:(events.c:114:reply_in_callback()) @@@ unlink req@ffff99f8119524c0 x1646444722514672/t0(0) o400->MGC10.151.27.39@o2ib@10.151.27.39@o2ib:26/25 lens 224/224 e 0 to 0 dl 1571287321 ref 1 fl Rpc:RN/0/ffffffff rc 0/-1 00000400:00000200:0.0:1571286866.293360:0:3642:0:(lib-md.c:69:lnet_md_unlink()) Unlinking md ffff99fa69f23ee0 00000100:00080000:11.0:1571286883.004713:0:3637:0:(recover.c:223:ptlrpc_wake_delayed()) @@@ waking (set ffff99fa992b4c00): req@ffff99fa7a75cdc0 x1646444722512224/t0(0) o3->nbptest-OST0004-osc-ffff99fa993e8000@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1 00000100:00000200:0.0F:1571286883.004777:0:3642:0:(niobuf.c:884:ptl_send_rpc()) Setup reply buffer: 8192 bytes, xid 1646444722512224, portal 4 00000100:00000200:0.0:1571286883.004785:0:3642:0:(niobuf.c:85:ptl_send_buf()) Sending 488 bytes to portal 6, xid 1646444722512224, offset 0 00000400:00000200:0.0:1571286883.004793:0:3642:0:(lib-move.c:4608:LNetPut()) LNetPut -> 12345-10.151.27.39@o2ib 00000400:00000200:0.0:1571286883.004798:0:3642:0:(lib-move.c:2450:lnet_handle_send_case_locked()) Source ANY to NMR: 10.151.27.39@o2ib local destination 00000400:00000200:0.0:1571286883.004809:0:3642:0:(lib-move.c:1728:lnet_handle_send()) TRACE: 10.151.7.67@o2ib(10.151.7.67@o2ib:<?>) -> 10.151.27.39@o2ib(10.151.27.39@o2ib:10.151.27.39@o2ib) : PUT try # 0 00000800:00000200:0.0:1571286883.004815:0:3642:0:(o2iblnd_cb.c:1647:kiblnd_send()) sending 488 bytes in 1 frags to 12345-10.151.27.39@o2ib 00000800:00000200:0.0:1571286883.004830:0:3642:0:(o2iblnd.c:403:kiblnd_find_peer_locked()) got peer_ni [ffff9a021946ac80] -> 10.151.27.39@o2ib (2) version: 12 00000800:00000200:0.0:1571286883.004832:0:3642:0:(o2iblnd_cb.c:1527:kiblnd_launch_tx()) conn[ffff99fbc7289a00] (68)++ 00000800:00000200:0.0:1571286883.004835:0:3642:0:(o2iblnd_cb.c:1301:kiblnd_queue_tx_locked()) conn[ffff99fbc7289a00] (69)++ 00000800:00000200:0.0:1571286883.004839:0:3642:0:(o2iblnd_cb.c:1533:kiblnd_launch_tx()) conn[ffff99fbc7289a00] (70)-- 00000100:00000200:37.0:1571286883.004870:0:3629:0:(events.c:57:request_out_callback()) @@@ type 5, status 0 req@ffff99fa7a75cdc0 x1646444722512224/t0(0) o3->nbptest-OST0004-osc-ffff99fa993e8000@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 0 dl 1571287337 ref 3 fl Rpc:/0/ffffffff rc 0/-1 00000100:00000200:1.0:1571286883.086207:0:3642:0:(niobuf.c:429:ptlrpc_register_bulk()) Setup 1 bulk put-sink buffers: 1 pages 4096 bytes, mbits x0x5d96ec8d403b0-0x5d96ec8d403b0, portal 8 HERE is the server side 00000100:00000200:1.0:1571286883.030498:0:10906:0:(service.c:2094:ptlrpc_server_handle_request()) got req 1646444722512224 00000400:00000200:3.0:1571286883.030956:0:10906:0:(lib-move.c:2796:LNetPut()) LNetPut -> 12345-10.151.7.67@o2ib 00000400:00000200:3.0:1571286883.030964:0:10906:0:(lib-move.c:1930:lnet_select_pathway()) TRACE: 10.151.27.39@o2ib(10.151.27.39@o2ib:10.151.27.39@o2ib) -> 10.151.7.67@o2ib(10.151.7.67@o2ib:10.151.7.67@o2ib) : PUT 00000800:00000200:3.0:1571286883.030968:0:10906:0:(o2iblnd_cb.c:1510:kiblnd_send()) sending 4096 bytes in 1 frags to 12345-10.151.7.67@o2ib 00000800:00000200:3.0:1571286883.030970:0:10906:0:(o2iblnd_cb.c:703:kiblnd_setup_rd_kiov()) niov 1 offset 0 nob 4096 00000800:00000200:3.0:1571286883.030973:0:10906:0:(o2iblnd.c:405:kiblnd_find_peer_locked()) got peer_ni [ffff8c3c04acc300] -> 10.151.7.67@o2ib (2) version: 12 00000800:00000200:3.0:1571286883.030975:0:10906:0:(o2iblnd_cb.c:1391:kiblnd_launch_tx()) conn[ffff8c3ba8064600] (68)++ 00000800:00000200:3.0:1571286883.030976:0:10906:0:(o2iblnd_cb.c:1166:kiblnd_queue_tx_locked()) conn[ffff8c3ba8064600] (69)++ 00000800:00000200:3.0:1571286883.030979:0:10906:0:(o2iblnd_cb.c:1397:kiblnd_launch_tx()) conn[ffff8c3ba8064600] (70)-- 00000100:00000200:3.0:1571286883.030982:0:10906:0:(niobuf.c:262:ptlrpc_start_bulk_transfer()) Transferring 1 pages 4096 bytes via portal 8 id 12345-10.151.7.67@o2ib mbits 0x0-0x0 00010000:00020000:13.0:1571286983.030947:0:10906:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk READ after 100+0s req@ffff8c3b6e3ba850 x1646444722512224/t0(0) o3->61da07f7-88f2-b96a-5d40-8a120a8a2bcb@10.151.7.67@o2ib:12/0 lens 488/432 e 0 to 0 dl 1571287137 ref 1 fl Interpret:/0/0 rc 0/0 I don't understand the client setup buffers like this put-sink buffers: 1 pages 4096 bytes, mbits x0x5d96ec8d403b0-0x5d96ec8d403b0, portal 8 But server sent Transferring 1 pages 4096 bytes via portal 8 id 12345-10.151.7.67@o2ib mbits 0x0-0x0 Doesn't the mbits have to match? Btw, my test is doing 4k IO. I uploaded all logs to ftp.whamcloud.com/uploads/ LU-12856 /nasa.debug.logs.tgz

            People

              tappro Mikhail Pershin
              mhanafi Mahmoud Hanafi
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: