Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12856

LustreError: 82937:0:(ldlm_lib.c:3268:target_bulk_io()) @@@ truncated bulk READ 0(270336)

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.14.0, Lustre 2.12.4
    • Lustre 2.12.2
    • None
    • L2.12.2 server & L2.10.8 Server
      L2.12.2 client & L2.11 Client
    • 2
    • 9223372036854775807

    Description

      We are seeing

       Oct 14 07:51:06 nbp7-oss7 kernel: [1110415.124675] LNet: 72766:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) PUT_NACK from 10.151.54.10@o2ib
      Oct 14 07:53:28 nbp7-oss7 kernel: [1110557.738283] LustreError: 48242:0:(ldlm_lib.c:3268:target_bulk_io()) @@@ truncated bulk READ 0(270336)  req@ffff8bd7711d9850 x1647322708401808/t0(0) o3->c1bb3af1-465f-4e4e-3b45-831f7f0aa442@10.151.53.134@o2ib:520/0 lens 488/440 e 0 to 0 dl 1571064920 ref 1 fl Interpret:/2/0 rc 0/0
      Oct 14 07:53:28 nbp7-oss7 kernel: [1110557.820414] Lustre: nbp7-OST0009: Bulk IO read error with c1bb3af1-465f-4e4e-3b45-831f7f0aa442 (at 10.151.53.134@o2ib), client will retry: rc -110
      Oct 14 07:58:40 nbp7-oss7 kernel: [1110869.753867] LNet: 72764:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) PUT_NACK from 10.151.54.10@o2ib
      Oct 14 08:00:58 nbp7-oss7 kernel: [1111007.747557] LustreError: 7338:0:(ldlm_lib.c:3268:target_bulk_io()) @@@ truncated bulk READ 0(270336)  req@ffff8bd4ef6c3050 x1647322708401808/t0(0) o3->c1bb3af1-465f-4e4e-3b45-831f7f0aa442@10.151.53.134@o2ib:220/0 lens 488/440 e 0 to 0 dl 1571065375 ref 1 fl Interpret:/2/0 rc 0/0
      Oct 14 08:00:58 nbp7-oss7 kernel: [1111007.829410] Lustre: nbp7-OST0009: Bulk IO read error with c1bb3af1-465f-4e4e-3b45-831f7f0aa442 (at 10.151.53.134@o2ib), client will retry: rc -110
      Oct 14 08:02:39 nbp7-oss7 kernel: [1111108.765855] Lustre: nbp7-OST0003: Client c1bb3af1-465f-4e4e-3b45-831f7f0aa442 (at 10.151.53.134@o2ib) reconnecting
      Oct 14 08:02:39 nbp7-oss7 kernel: [1111108.800504] Lustre: Skipped 5 previous similar messages
      Oct 14 08:02:39 nbp7-oss7 kernel: [1111108.818286] Lustre: nbp7-OST0003: Connection restored to 2ee3c4e1-9fd7-3338-5bf8-d1f02bcd8a20 (at 10.151.53.134@o2ib)
      Oct 14 08:02:39 nbp7-oss7 kernel: [1111108.818288] Lustre: Skipped 5 previous similar messages
      Oct 14 08:09:47 nbp7-oss7 kernel: [1111536.849491] LNet: 72766:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) PUT_NACK from 10.151.54.10@o2ib
      Oct 14 08:11:48 nbp7-oss7 kernel: [1111657.759009] LustreError: 82937:0:(ldlm_lib.c:3268:target_bulk_io()) @@@ truncated bulk READ 0(270336)  req@ffff8bd724b11850 x1647322708401808/t0(0) o3->c1bb3af1-465f-4e4e-3b45-831f7f0aa442@10.151.53.134@o2ib:132/0 lens 488/440 e 0 to 0 dl 1571066042 ref 1 fl Interpret:/2/0 rc 0/0
      Oct 14 08:11:48 nbp7-oss7 kernel: [1111657.841189] Lustre: nbp7-OST0009: Bulk IO read error with c1bb3af1-465f-4e4e-3b45-831f7f0aa442 (at 10.151.53.134@o2ib), client will retry: rc -110
      Oct 14 08:12:57 nbp7-oss7 kernel: [1111726.231189] Lustre: nbp7-OST000f: Client c1bb3af1-465f-4e4e-3b45-831f7f0aa442 (at 10.151.53.134@o2ib) reconnecting
      Oct 14 08:12:57 nbp7-oss7 kernel: [1111726.265819] Lustre: Skipped 5 previous similar messages
      

      We have the 3 patch reported in LU-12385 and LU-12772 applied to both client and server.
      Client Configs

      r593i6n16 ~ # lnetctl global show
      global:
          numa_range: 0
          max_intf: 200
          discovery: 1
          drop_asym_route: 0
          retry_count: 0
          transaction_timeout: 100
          health_sensitivity: 0
          recovery_interval: 1
      ----
      options ko2iblnd require_privileged_port=0 use_privileged_port=0
      options ko2iblnd timeout=100 retry_count=7 peer_timeout=0 map_on_demand=32 peer_credits=32 concurrent_sends=32
      options lnet networks=o2ib(ib1)
      options lnet avoid_asym_router_failure=1 check_routers_before_use=1 small_router_buffers=65536 large_router_buffers=8192
      options lnet lnet_transaction_timeout=100
      options ptlrpc at_max=600 at_min=200
      

      Server Configs

      nbp7-oss7 ~ # lnetctl global show
      global:
          numa_range: 0
          max_intf: 200
          discovery: 1
          drop_asym_route: 0
          retry_count: 0
          transaction_timeout: 100
          health_sensitivity: 0
          recovery_interval: 1
      ----
      options ko2iblnd require_privileged_port=0 use_privileged_port=0
      options ko2iblnd ntx=251072 credits=125536 fmr_pool_size=62769 
      options ko2iblnd timeout=100 retry_count=7 peer_timeout=0 map_on_demand=32 peer_credits=32 concurrent_sends=32
      
      
      #lnet
      options lnet networks=o2ib(ib1) 
      options lnet routes="o2ib233 10.151.26.[80-94]@o2ib; o2ib313 10.151.25.[195-197,202-205,222]@o2ib 10.151.26.[60,127,140-144,150-154]@o2ib; o2ib417 10.151.26.[148,149]@o2ib 10.151.25.[167-170]@o2ib"
      options lnet dead_router_check_interval=60 live_router_check_interval=30
      options lnet avoid_asym_router_failure=1 check_routers_before_use=1 small_router_buffers=65536 large_router_buffers=8192options lnet lnet_transaction_timeout=100
      options ptlrpc at_max=600 at_min=200
      

      These bulk errors are a major issue. We though matching client and server to 2.12.2 would solve it but doesn't look like it.

      Attachments

        1. 0001-LU-12856-debug.patch
          2 kB
        2. 0002-LU-12856-revert-LU-9983.patch
          2 kB
        3. client.10.151.53.134.debug.gz
          805 kB
        4. nbp7-mds.2019-10-16.out.gz
          65.90 MB
        5. r407i2n14.out2.gz
          50.70 MB
        6. server.xid.1647322708401808.debug.gz
          78.16 MB

        Issue Links

          Activity

            [LU-12856] LustreError: 82937:0:(ldlm_lib.c:3268:target_bulk_io()) @@@ truncated bulk READ 0(270336)

            This should not have been close.

             

            mhanafi Mahmoud Hanafi added a comment - This should not have been close.  
            pjones Peter Jones added a comment -

            ok

            pjones Peter Jones added a comment - ok

            My last test was with 2.10.6 so that would be the case reported in LU-12772. We can close that one as dup and track this issue here.

            mhanafi Mahmoud Hanafi added a comment - My last test was with 2.10.6 so that would be the case reported in LU-12772 . We can close that one as dup and track this issue here.

            You're right that the mbits need to match. The question is why they don't for the same xid. I'll need to dig deeper, as this is not an LNet issue.

            One question though, if we go back to the original problem reported in LU-12772, can we identify that it's the same problem? IE the mbits are 0x0?

            ashehata Amir Shehata (Inactive) added a comment - You're right that the mbits need to match. The question is why they don't for the same xid. I'll need to dig deeper, as this is not an LNet issue. One question though, if we go back to the original problem reported in LU-12772 , can we identify that it's the same problem? IE the mbits are 0x0?
            mhanafi Mahmoud Hanafi added a comment - - edited

            I got it to reproduce with 2.11 client. May be its a server side issue. I am going to try a older server build. I know we never had this issue with 2.7 servers and 2.7 clients.  I'll try a 2.10.6 server.

            I tried 2.10.6 server it reproduced.

            All transfer with mbits 0x0-0x0 endup with

            Bulk timout.

             

            mhanafi Mahmoud Hanafi added a comment - - edited I got it to reproduce with 2.11 client. May be its a server side issue. I am going to try a older server build. I know we never had this issue with 2.7 servers and 2.7 clients.  I'll try a 2.10.6 server. I tried 2.10.6 server it reproduced. All transfer with mbits 0x0-0x0 endup with Bulk timout.  

            Re-ran and got better debugging. Here  xid 1646444722512224 to nid 10.151.7.67@o2ib (r407i0n3) times out. This is client side after reboot of server

            00000100:00080000:0.0F:1571286150.709202:0:3642:0:(client.c:1575:ptlrpc_send_new_req()) @@@ req waiting for recovery: (FULL != DISCONN)  req@ffff99fa7a75cdc0 x1646444722512224/t0(0) o3->nbptest-OST0004-osc-ffff99fa993e8000@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
            00000100:00000200:0.0F:1571286866.293058:0:3642:0:(niobuf.c:884:ptl_send_rpc()) Setup reply buffer: 512 bytes, xid 1646444722514672, portal 25
            00000100:00000200:0.0:1571286866.293067:0:3642:0:(niobuf.c:85:ptl_send_buf()) Sending 224 bytes to portal 26, xid 1646444722514672, offset 0
            00000400:00000200:0.0:1571286866.293074:0:3642:0:(lib-move.c:4608:LNetPut()) LNetPut -> 12345-10.151.27.39@o2ib
            00000400:00000200:0.0:1571286866.293083:0:3642:0:(lib-move.c:2450:lnet_handle_send_case_locked()) Source ANY to NMR:  10.151.27.39@o2ib local destination
            00000400:00000200:0.0:1571286866.293096:0:3642:0:(lib-move.c:1728:lnet_handle_send()) TRACE: 10.151.7.67@o2ib(10.151.7.67@o2ib:<?>) -> 10.151.27.39@o2ib(10.151.27.39@o2ib:10.151.27.39@o2ib) : PUT try# 0
            00000800:00000200:0.0:1571286866.293103:0:3642:0:(o2iblnd_cb.c:1647:kiblnd_send()) sending 224 bytes in 1 frags to 12345-10.151.27.39@o2ib
            00000800:00000200:0.0:1571286866.293110:0:3642:0:(o2iblnd.c:403:kiblnd_find_peer_locked()) got peer_ni [ffff9a021946ac80] -> 10.151.27.39@o2ib (2) version: 12
            00000800:00000200:0.0:1571286866.293113:0:3642:0:(o2iblnd_cb.c:1527:kiblnd_launch_tx()) conn[ffff99fbc7289a00] (68)++
            00000800:00000200:0.0:1571286866.293116:0:3642:0:(o2iblnd_cb.c:1301:kiblnd_queue_tx_locked()) conn[ffff99fbc7289a00] (69)++
            00000800:00000200:0.0:1571286866.293121:0:3642:0:(o2iblnd_cb.c:1533:kiblnd_launch_tx()) conn[ffff99fbc7289a00] (70)--
            00000100:00000200:0.0:1571286866.293338:0:3642:0:(events.c:93:reply_in_callback()) @@@ type 6, status 0  req@ffff99f8119524c0 x1646444722514672/t0(0) o400->MGC10.151.27.39@o2ib@10.151.27.39@o2ib:26/25 lens 224/224 e 0 to 0 dl 1571287321 ref 1 fl Rpc:RN/0/ffffffff rc 0/-1
            00000100:00000200:0.0:1571286866.293350:0:3642:0:(events.c:114:reply_in_callback()) @@@ unlink  req@ffff99f8119524c0 x1646444722514672/t0(0) o400->MGC10.151.27.39@o2ib@10.151.27.39@o2ib:26/25 lens 224/224 e 0 to 0 dl 1571287321 ref 1 fl Rpc:RN/0/ffffffff rc 0/-1
            00000400:00000200:0.0:1571286866.293360:0:3642:0:(lib-md.c:69:lnet_md_unlink()) Unlinking md ffff99fa69f23ee0
            00000100:00080000:11.0:1571286883.004713:0:3637:0:(recover.c:223:ptlrpc_wake_delayed()) @@@ waking (set ffff99fa992b4c00):  req@ffff99fa7a75cdc0 x1646444722512224/t0(0) o3->nbptest-OST0004-osc-ffff99fa993e8000@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
            00000100:00000200:0.0F:1571286883.004777:0:3642:0:(niobuf.c:884:ptl_send_rpc()) Setup reply buffer: 8192 bytes, xid 1646444722512224, portal 4
            00000100:00000200:0.0:1571286883.004785:0:3642:0:(niobuf.c:85:ptl_send_buf()) Sending 488 bytes to portal 6, xid 1646444722512224, offset 0
            00000400:00000200:0.0:1571286883.004793:0:3642:0:(lib-move.c:4608:LNetPut()) LNetPut -> 12345-10.151.27.39@o2ib
            00000400:00000200:0.0:1571286883.004798:0:3642:0:(lib-move.c:2450:lnet_handle_send_case_locked()) Source ANY to NMR:  10.151.27.39@o2ib local destination
            00000400:00000200:0.0:1571286883.004809:0:3642:0:(lib-move.c:1728:lnet_handle_send()) TRACE: 10.151.7.67@o2ib(10.151.7.67@o2ib:<?>) -> 10.151.27.39@o2ib(10.151.27.39@o2ib:10.151.27.39@o2ib) : PUT try# 0
            00000800:00000200:0.0:1571286883.004815:0:3642:0:(o2iblnd_cb.c:1647:kiblnd_send()) sending 488 bytes in 1 frags to 12345-10.151.27.39@o2ib
            00000800:00000200:0.0:1571286883.004830:0:3642:0:(o2iblnd.c:403:kiblnd_find_peer_locked()) got peer_ni [ffff9a021946ac80] -> 10.151.27.39@o2ib (2) version: 12
            00000800:00000200:0.0:1571286883.004832:0:3642:0:(o2iblnd_cb.c:1527:kiblnd_launch_tx()) conn[ffff99fbc7289a00] (68)++
            00000800:00000200:0.0:1571286883.004835:0:3642:0:(o2iblnd_cb.c:1301:kiblnd_queue_tx_locked()) conn[ffff99fbc7289a00] (69)++
            00000800:00000200:0.0:1571286883.004839:0:3642:0:(o2iblnd_cb.c:1533:kiblnd_launch_tx()) conn[ffff99fbc7289a00] (70)--
            00000100:00000200:37.0:1571286883.004870:0:3629:0:(events.c:57:request_out_callback()) @@@ type 5, status 0  req@ffff99fa7a75cdc0 x1646444722512224/t0(0) o3->nbptest-OST0004-osc-ffff99fa993e8000@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 0 dl 1571287337 ref 3 fl Rpc:/0/ffffffff rc 0/-1
            00000100:00000200:1.0:1571286883.086207:0:3642:0:(niobuf.c:429:ptlrpc_register_bulk()) Setup 1 bulk put-sink buffers: 1 pages 4096 bytes, mbits x0x5d96ec8d403b0-0x5d96ec8d403b0, portal 8
            

            HERE is the server side

            00000100:00000200:1.0:1571286883.030498:0:10906:0:(service.c:2094:ptlrpc_server_handle_request()) got req 1646444722512224
            00000400:00000200:3.0:1571286883.030956:0:10906:0:(lib-move.c:2796:LNetPut()) LNetPut -> 12345-10.151.7.67@o2ib
            00000400:00000200:3.0:1571286883.030964:0:10906:0:(lib-move.c:1930:lnet_select_pathway()) TRACE: 10.151.27.39@o2ib(10.151.27.39@o2ib:10.151.27.39@o2ib) -> 10.151.7.67@o2ib(10.151.7.67@o2ib:10.151.7.67@o2ib) : PUT
            00000800:00000200:3.0:1571286883.030968:0:10906:0:(o2iblnd_cb.c:1510:kiblnd_send()) sending 4096 bytes in 1 frags to 12345-10.151.7.67@o2ib
            00000800:00000200:3.0:1571286883.030970:0:10906:0:(o2iblnd_cb.c:703:kiblnd_setup_rd_kiov()) niov 1 offset 0 nob 4096
            00000800:00000200:3.0:1571286883.030973:0:10906:0:(o2iblnd.c:405:kiblnd_find_peer_locked()) got peer_ni [ffff8c3c04acc300] -> 10.151.7.67@o2ib (2) version: 12
            00000800:00000200:3.0:1571286883.030975:0:10906:0:(o2iblnd_cb.c:1391:kiblnd_launch_tx()) conn[ffff8c3ba8064600] (68)++
            00000800:00000200:3.0:1571286883.030976:0:10906:0:(o2iblnd_cb.c:1166:kiblnd_queue_tx_locked()) conn[ffff8c3ba8064600] (69)++
            00000800:00000200:3.0:1571286883.030979:0:10906:0:(o2iblnd_cb.c:1397:kiblnd_launch_tx()) conn[ffff8c3ba8064600] (70)--
            00000100:00000200:3.0:1571286883.030982:0:10906:0:(niobuf.c:262:ptlrpc_start_bulk_transfer()) Transferring 1 pages 4096 bytes via portal 8 id 12345-10.151.7.67@o2ib mbits 0x0-0x0
            00010000:00020000:13.0:1571286983.030947:0:10906:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk READ after 100+0s  req@ffff8c3b6e3ba850 x1646444722512224/t0(0) o3->61da07f7-88f2-b96a-5d40-8a120a8a2bcb@10.151.7.67@o2ib:12/0 lens 488/432 e 0 to 0 dl 1571287137 ref 1 fl Interpret:/0/0 rc 0/0
            

            I don't understand the client setup buffers like this

             put-sink buffers: 1 pages 4096 bytes, mbits x0x5d96ec8d403b0-0x5d96ec8d403b0, portal 8
            

            But server sent

            Transferring 1 pages 4096 bytes via portal 8 id 12345-10.151.7.67@o2ib mbits 0x0-0x0
            

            Doesn't the mbits have to match?

            Btw, my test is doing 4k IO.
            I uploaded all logs to ftp.whamcloud.com/uploads/LU-12856/nasa.debug.logs.tgz

            mhanafi Mahmoud Hanafi added a comment - Re-ran and got better debugging. Here  xid 1646444722512224 to nid 10.151.7.67@o2ib (r407i0n3) times out. This is client side after reboot of server 00000100:00080000:0.0F:1571286150.709202:0:3642:0:(client.c:1575:ptlrpc_send_new_req()) @@@ req waiting for recovery: (FULL != DISCONN) req@ffff99fa7a75cdc0 x1646444722512224/t0(0) o3->nbptest-OST0004-osc-ffff99fa993e8000@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1 00000100:00000200:0.0F:1571286866.293058:0:3642:0:(niobuf.c:884:ptl_send_rpc()) Setup reply buffer: 512 bytes, xid 1646444722514672, portal 25 00000100:00000200:0.0:1571286866.293067:0:3642:0:(niobuf.c:85:ptl_send_buf()) Sending 224 bytes to portal 26, xid 1646444722514672, offset 0 00000400:00000200:0.0:1571286866.293074:0:3642:0:(lib-move.c:4608:LNetPut()) LNetPut -> 12345-10.151.27.39@o2ib 00000400:00000200:0.0:1571286866.293083:0:3642:0:(lib-move.c:2450:lnet_handle_send_case_locked()) Source ANY to NMR: 10.151.27.39@o2ib local destination 00000400:00000200:0.0:1571286866.293096:0:3642:0:(lib-move.c:1728:lnet_handle_send()) TRACE: 10.151.7.67@o2ib(10.151.7.67@o2ib:<?>) -> 10.151.27.39@o2ib(10.151.27.39@o2ib:10.151.27.39@o2ib) : PUT try # 0 00000800:00000200:0.0:1571286866.293103:0:3642:0:(o2iblnd_cb.c:1647:kiblnd_send()) sending 224 bytes in 1 frags to 12345-10.151.27.39@o2ib 00000800:00000200:0.0:1571286866.293110:0:3642:0:(o2iblnd.c:403:kiblnd_find_peer_locked()) got peer_ni [ffff9a021946ac80] -> 10.151.27.39@o2ib (2) version: 12 00000800:00000200:0.0:1571286866.293113:0:3642:0:(o2iblnd_cb.c:1527:kiblnd_launch_tx()) conn[ffff99fbc7289a00] (68)++ 00000800:00000200:0.0:1571286866.293116:0:3642:0:(o2iblnd_cb.c:1301:kiblnd_queue_tx_locked()) conn[ffff99fbc7289a00] (69)++ 00000800:00000200:0.0:1571286866.293121:0:3642:0:(o2iblnd_cb.c:1533:kiblnd_launch_tx()) conn[ffff99fbc7289a00] (70)-- 00000100:00000200:0.0:1571286866.293338:0:3642:0:(events.c:93:reply_in_callback()) @@@ type 6, status 0 req@ffff99f8119524c0 x1646444722514672/t0(0) o400->MGC10.151.27.39@o2ib@10.151.27.39@o2ib:26/25 lens 224/224 e 0 to 0 dl 1571287321 ref 1 fl Rpc:RN/0/ffffffff rc 0/-1 00000100:00000200:0.0:1571286866.293350:0:3642:0:(events.c:114:reply_in_callback()) @@@ unlink req@ffff99f8119524c0 x1646444722514672/t0(0) o400->MGC10.151.27.39@o2ib@10.151.27.39@o2ib:26/25 lens 224/224 e 0 to 0 dl 1571287321 ref 1 fl Rpc:RN/0/ffffffff rc 0/-1 00000400:00000200:0.0:1571286866.293360:0:3642:0:(lib-md.c:69:lnet_md_unlink()) Unlinking md ffff99fa69f23ee0 00000100:00080000:11.0:1571286883.004713:0:3637:0:(recover.c:223:ptlrpc_wake_delayed()) @@@ waking (set ffff99fa992b4c00): req@ffff99fa7a75cdc0 x1646444722512224/t0(0) o3->nbptest-OST0004-osc-ffff99fa993e8000@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1 00000100:00000200:0.0F:1571286883.004777:0:3642:0:(niobuf.c:884:ptl_send_rpc()) Setup reply buffer: 8192 bytes, xid 1646444722512224, portal 4 00000100:00000200:0.0:1571286883.004785:0:3642:0:(niobuf.c:85:ptl_send_buf()) Sending 488 bytes to portal 6, xid 1646444722512224, offset 0 00000400:00000200:0.0:1571286883.004793:0:3642:0:(lib-move.c:4608:LNetPut()) LNetPut -> 12345-10.151.27.39@o2ib 00000400:00000200:0.0:1571286883.004798:0:3642:0:(lib-move.c:2450:lnet_handle_send_case_locked()) Source ANY to NMR: 10.151.27.39@o2ib local destination 00000400:00000200:0.0:1571286883.004809:0:3642:0:(lib-move.c:1728:lnet_handle_send()) TRACE: 10.151.7.67@o2ib(10.151.7.67@o2ib:<?>) -> 10.151.27.39@o2ib(10.151.27.39@o2ib:10.151.27.39@o2ib) : PUT try # 0 00000800:00000200:0.0:1571286883.004815:0:3642:0:(o2iblnd_cb.c:1647:kiblnd_send()) sending 488 bytes in 1 frags to 12345-10.151.27.39@o2ib 00000800:00000200:0.0:1571286883.004830:0:3642:0:(o2iblnd.c:403:kiblnd_find_peer_locked()) got peer_ni [ffff9a021946ac80] -> 10.151.27.39@o2ib (2) version: 12 00000800:00000200:0.0:1571286883.004832:0:3642:0:(o2iblnd_cb.c:1527:kiblnd_launch_tx()) conn[ffff99fbc7289a00] (68)++ 00000800:00000200:0.0:1571286883.004835:0:3642:0:(o2iblnd_cb.c:1301:kiblnd_queue_tx_locked()) conn[ffff99fbc7289a00] (69)++ 00000800:00000200:0.0:1571286883.004839:0:3642:0:(o2iblnd_cb.c:1533:kiblnd_launch_tx()) conn[ffff99fbc7289a00] (70)-- 00000100:00000200:37.0:1571286883.004870:0:3629:0:(events.c:57:request_out_callback()) @@@ type 5, status 0 req@ffff99fa7a75cdc0 x1646444722512224/t0(0) o3->nbptest-OST0004-osc-ffff99fa993e8000@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 0 dl 1571287337 ref 3 fl Rpc:/0/ffffffff rc 0/-1 00000100:00000200:1.0:1571286883.086207:0:3642:0:(niobuf.c:429:ptlrpc_register_bulk()) Setup 1 bulk put-sink buffers: 1 pages 4096 bytes, mbits x0x5d96ec8d403b0-0x5d96ec8d403b0, portal 8 HERE is the server side 00000100:00000200:1.0:1571286883.030498:0:10906:0:(service.c:2094:ptlrpc_server_handle_request()) got req 1646444722512224 00000400:00000200:3.0:1571286883.030956:0:10906:0:(lib-move.c:2796:LNetPut()) LNetPut -> 12345-10.151.7.67@o2ib 00000400:00000200:3.0:1571286883.030964:0:10906:0:(lib-move.c:1930:lnet_select_pathway()) TRACE: 10.151.27.39@o2ib(10.151.27.39@o2ib:10.151.27.39@o2ib) -> 10.151.7.67@o2ib(10.151.7.67@o2ib:10.151.7.67@o2ib) : PUT 00000800:00000200:3.0:1571286883.030968:0:10906:0:(o2iblnd_cb.c:1510:kiblnd_send()) sending 4096 bytes in 1 frags to 12345-10.151.7.67@o2ib 00000800:00000200:3.0:1571286883.030970:0:10906:0:(o2iblnd_cb.c:703:kiblnd_setup_rd_kiov()) niov 1 offset 0 nob 4096 00000800:00000200:3.0:1571286883.030973:0:10906:0:(o2iblnd.c:405:kiblnd_find_peer_locked()) got peer_ni [ffff8c3c04acc300] -> 10.151.7.67@o2ib (2) version: 12 00000800:00000200:3.0:1571286883.030975:0:10906:0:(o2iblnd_cb.c:1391:kiblnd_launch_tx()) conn[ffff8c3ba8064600] (68)++ 00000800:00000200:3.0:1571286883.030976:0:10906:0:(o2iblnd_cb.c:1166:kiblnd_queue_tx_locked()) conn[ffff8c3ba8064600] (69)++ 00000800:00000200:3.0:1571286883.030979:0:10906:0:(o2iblnd_cb.c:1397:kiblnd_launch_tx()) conn[ffff8c3ba8064600] (70)-- 00000100:00000200:3.0:1571286883.030982:0:10906:0:(niobuf.c:262:ptlrpc_start_bulk_transfer()) Transferring 1 pages 4096 bytes via portal 8 id 12345-10.151.7.67@o2ib mbits 0x0-0x0 00010000:00020000:13.0:1571286983.030947:0:10906:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk READ after 100+0s req@ffff8c3b6e3ba850 x1646444722512224/t0(0) o3->61da07f7-88f2-b96a-5d40-8a120a8a2bcb@10.151.7.67@o2ib:12/0 lens 488/432 e 0 to 0 dl 1571287137 ref 1 fl Interpret:/0/0 rc 0/0 I don't understand the client setup buffers like this put-sink buffers: 1 pages 4096 bytes, mbits x0x5d96ec8d403b0-0x5d96ec8d403b0, portal 8 But server sent Transferring 1 pages 4096 bytes via portal 8 id 12345-10.151.7.67@o2ib mbits 0x0-0x0 Doesn't the mbits have to match? Btw, my test is doing 4k IO. I uploaded all logs to ftp.whamcloud.com/uploads/ LU-12856 /nasa.debug.logs.tgz

            Is it possible to get access to the test system? Now that we have a way to reproduce, I can add extra debug output to try and track what's going on. I'm having a bit of trouble tracking down why the bulk is timing out from the logs. The first instance of the problematic xids on the server is when we get:

             
            00010000:00020000:9.0:1571241290.565033:0:10331:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk READ after 100+0s  req@ffff8d3e5cf94050 x1647564415493712/t0(0)     o3->51a7d23c-3cd9-fd1b-280d-498c636237e0@10.151.8.26@o2ib:275/0 lens 488/432 e 0 to 0 dl 1571241345 ref 1 fl Interpret:/0/0 rc 0/0
             

            But I don't see any instance of the xid before this. And the last instance that I see is:

            00000100:00000200:9.0:1571241545.717364:0:10676:0:(service.c:2094:ptlrpc_server_handle_request()) got req 1647564412455904
             

            I don't see how the bulk times out.
             
            At the LNet level it seems like there are no glaring communication issues. IE there are no timeouts or failures reported by LNet or the LND (at least in the logs I see). The only problem is the rejection you saw, but that's probably because the connection request came early. It was retried and the connection established. It's not responsible for the stuck requests.
             
            One possibility is that the server gets the request and the response is queued, but never sent. I can not verify that from the logs. The reason I'm suspecting this is because of the messages:

            00000100:00000040:4.0:1571241024.735760:0:6055:0:(events.c:359:request_in_callback()) Buffer complete: 63 buffers still posted
             

            There are many of these messages and the maximum number of buffers I see is 63. It doesn't' go below the 50s

            Then there is the case of the PUT_NAK:

            00000800:00000400:9.0:1571241545.699924:0:6060:0:(o2iblnd_cb.c:401:kiblnd_handle_rx()) PUT_NACK from 10.151.7.114@o2ib
            00000800:00000400:5.0:1571241545.699930:0:6062:0:(o2iblnd_cb.c:401:kiblnd_handle_rx()) PUT_NACK from 10.151.7.114@o2ib
            00000800:00000400:15.0:1571241545.699948:0:6059:0:(o2iblnd_cb.c:401:kiblnd_handle_rx()) PUT_NACK from 10.151.7.114@o2ib
            00000800:00000400:5.0:1571241545.717462:0:6062:0:(o2iblnd_cb.c:401:kiblnd_handle_rx()) PUT_NACK from 10.151.7.114@o2ib
            00000800:00000400:5.0:1571241545.717484:0:6062:0:(o2iblnd_cb.c:401:kiblnd_handle_rx()) PUT_NACK from 10.151.7.114@o2ib
            00000800:00000400:15.0:1571241545.717490:0:6059:0:(o2iblnd_cb.c:401:kiblnd_handle_rx()) PUT_NACK from 10.151.7.114@o2ib
            00000800:00000400:9.0:1571241545.758883:0:6061:0:(o2iblnd_cb.c:401:kiblnd_handle_rx()) PUT_NACK from 10.151.7.114@o2ib
            00000800:00000400:15.0:1571241545.841772:0:6059:0:(o2iblnd_cb.c:401:kiblnd_handle_rx()) PUT_NACK from 10.151.7.114@o2ib
            

             
            They all happen at the same second the request is received from the client: 1571241545.
             
            Moreover I sampled a subset of the xids which the client reports as timing out and on the server they appear to all arrive at the same second 1571241545. It's likely that an event occurred triggering some requests to get stuck in this state. They timeout and are resend close to each other.

            On the client all the xids exhibit the same behaviour. Send the request, times out after 455s, resend, repeat
             
            These are a bunch of clues, but I haven't been able to put them together to form a specific conclusion yet. It's likely I'll need a few iterations of adding extra debugging, and reproducing the problem, if I'm able to get access to the test system. Let me know if that's possible..
             
            Otherwise, first step is to capture enough server logs so we see what happens after the server receives the problematic xid.

            ashehata Amir Shehata (Inactive) added a comment - Is it possible to get access to the test system? Now that we have a way to reproduce, I can add extra debug output to try and track what's going on. I'm having a bit of trouble tracking down why the bulk is timing out from the logs. The first instance of the problematic xids on the server is when we get:   00010000:00020000:9.0:1571241290.565033:0:10331:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk READ after 100+0s  req@ffff8d3e5cf94050 x1647564415493712/t0(0)     o3->51a7d23c-3cd9-fd1b-280d-498c636237e0@10.151.8.26@o2ib:275/0 lens 488/432 e 0 to 0 dl 1571241345 ref 1 fl Interpret:/0/0 rc 0/0   But I don't see any instance of the xid before this. And the last instance that I see is: 00000100:00000200:9.0:1571241545.717364:0:10676:0:(service.c:2094:ptlrpc_server_handle_request()) got req 1647564412455904   I don't see how the bulk times out.   At the LNet level it seems like there are no glaring communication issues. IE there are no timeouts or failures reported by LNet or the LND (at least in the logs I see). The only problem is the rejection you saw, but that's probably because the connection request came early. It was retried and the connection established. It's not responsible for the stuck requests.   One possibility is that the server gets the request and the response is queued, but never sent. I can not verify that from the logs. The reason I'm suspecting this is because of the messages: 00000100:00000040:4.0:1571241024.735760:0:6055:0:(events.c:359:request_in_callback()) Buffer complete: 63 buffers still posted   There are many of these messages and the maximum number of buffers I see is 63. It doesn't' go below the 50s Then there is the case of the PUT_NAK: 00000800:00000400:9.0:1571241545.699924:0:6060:0:(o2iblnd_cb.c:401:kiblnd_handle_rx()) PUT_NACK from 10.151.7.114@o2ib 00000800:00000400:5.0:1571241545.699930:0:6062:0:(o2iblnd_cb.c:401:kiblnd_handle_rx()) PUT_NACK from 10.151.7.114@o2ib 00000800:00000400:15.0:1571241545.699948:0:6059:0:(o2iblnd_cb.c:401:kiblnd_handle_rx()) PUT_NACK from 10.151.7.114@o2ib 00000800:00000400:5.0:1571241545.717462:0:6062:0:(o2iblnd_cb.c:401:kiblnd_handle_rx()) PUT_NACK from 10.151.7.114@o2ib 00000800:00000400:5.0:1571241545.717484:0:6062:0:(o2iblnd_cb.c:401:kiblnd_handle_rx()) PUT_NACK from 10.151.7.114@o2ib 00000800:00000400:15.0:1571241545.717490:0:6059:0:(o2iblnd_cb.c:401:kiblnd_handle_rx()) PUT_NACK from 10.151.7.114@o2ib 00000800:00000400:9.0:1571241545.758883:0:6061:0:(o2iblnd_cb.c:401:kiblnd_handle_rx()) PUT_NACK from 10.151.7.114@o2ib 00000800:00000400:15.0:1571241545.841772:0:6059:0:(o2iblnd_cb.c:401:kiblnd_handle_rx()) PUT_NACK from 10.151.7.114@o2ib   They all happen at the same second the request is received from the client: 1571241545.   Moreover I sampled a subset of the xids which the client reports as timing out and on the server they appear to all arrive at the same second 1571241545. It's likely that an event occurred triggering some requests to get stuck in this state. They timeout and are resend close to each other. On the client all the xids exhibit the same behaviour. Send the request, times out after 455s, resend, repeat   These are a bunch of clues, but I haven't been able to put them together to form a specific conclusion yet. It's likely I'll need a few iterations of adding extra debugging, and reproducing the problem, if I'm able to get access to the test system. Let me know if that's possible..   Otherwise, first step is to capture enough server logs so we see what happens after the server receives the problematic xid.

            x1646705467974160 is from client (10.151.10.96@o2ib)
            That was the client stuck from yesterday. I should have cleaned up that one before retesting.
            We are looking at
            r407i2n14 has nid 10.151.7.114@o2ib But I think the client logs start after the recover.

            If you can't get any useful info I can retry and capture longer logs on the client.

            This is the time line

            1571240785           Server reboot
            1571241024          Remount and recovery
            1571241068.466059   Recovery finished

            One of XID stuck (x1647564412455904)
            SERVER LOG

            00010000:00020000:11.0:1571241290.559039:0:9679:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk READ after 100+0s  req@ffff8d3e5c84b450 x1647564412455904/t0(0) o3->ae2a485b-93c9-11e3-740d-6b2550d66f21@10.151.7.114@o2ib:275/0 lens 488/432 e 0 to 0 dl 1571241345 ref 1 fl Interpret:/0/0 rc 0/0
            00010000:00020000:15.0:1571241290.559177:0:10654:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk READ after 100+0s  req@ffff8d3e5c849c50 x1647564412456000/t0(0) o3->ae2a485b-93c9-11e3-740d-6b2550d66f21@10.151.7.114@o2ib:275/0 lens 488/432 e 0 to 0 dl 1571241345 ref 1 fl Interpret:/0/0 rc 0/0
            00010000:00020000:15.0:1571241290.581931:0:10321:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk READ after 100+0s  req@ffff8d3e5c84d050 x1647564412455712/t0(0) o3->ae2a485b-93c9-11e3-740d-6b2550d66f21@10.151.7.114@o2ib:275/0 lens 488/432 e 0 to 0 dl 1571241345 ref 1 fl Interpret:/0/0 rc 0/0
            

            (r407i2n14) Client Log

            00000100:00080000:39.0:1571241090.523097:0:2948:0:(recover.c:223:ptlrpc_wake_delayed()) @@@ waking (set ffff9b20dda1f800):  req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
            00000100:00000200:29.0:1571241090.523223:0:2956:0:(niobuf.c:884:ptl_send_rpc()) Setup reply buffer: 8192 bytes, xid 1647564412455904, portal 4
            00000100:00000040:29.0:1571241090.523227:0:2956:0:(niobuf.c:905:ptl_send_rpc()) @@@ send flg=0  req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 0 dl 1571241545 ref 3 fl Rpc:/0/ffffffff rc 0/-1
            00000100:00000200:29.0:1571241090.523275:0:2956:0:(niobuf.c:85:ptl_send_buf()) Sending 488 bytes to portal 6, xid 1647564412455904, offset 0
            00000100:00000200:17.0:1571241090.523405:0:2940:0:(events.c:57:request_out_callback()) @@@ type 5, status 0  req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 0 dl 1571241545 ref 3 fl Rpc:/0/ffffffff rc 0/-1
            00000100:00000040:17.0:1571241090.523414:0:2940:0:(client.c:2545:__ptlrpc_req_finished()) @@@ refcount now 2  req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 0 dl 1571241545 ref 3 fl Rpc:/0/ffffffff rc 0/-1
            00000100:00000400:24.0:1571241545.699041:0:2956:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1571241090/real 1571241090]  req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 1 dl 1571241545 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
            00000100:00000200:24.0:1571241545.699047:0:2956:0:(events.c:93:reply_in_callback()) @@@ type 6, status 0  req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 1 dl 1571241545 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
            00000100:00000200:24.0:1571241545.699051:0:2956:0:(events.c:114:reply_in_callback()) @@@ unlink  req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 1 dl 1571241545 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
            00000100:00080000:39.0:1571241545.717146:0:2948:0:(recover.c:223:ptlrpc_wake_delayed()) @@@ waking (set ffff9b20dda1f800):  req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 1 dl 1571241545 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
            00000100:00000200:24.0:1571241545.717171:0:2956:0:(niobuf.c:884:ptl_send_rpc()) Setup reply buffer: 8192 bytes, xid 1647564412455904, portal 4
            00000100:00000040:24.0:1571241545.717173:0:2956:0:(niobuf.c:905:ptl_send_rpc()) @@@ send flg=2  req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 0 dl 1571242000 ref 3 fl Rpc:/2/ffffffff rc 0/-1
            00000100:00000200:24.0:1571241545.717179:0:2956:0:(niobuf.c:85:ptl_send_buf()) Sending 488 bytes to portal 6, xid 1647564412455904, offset 0
            00000100:00000200:34.0:1571241545.717231:0:2942:0:(events.c:57:request_out_callback()) @@@ type 5, status 0  req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 0 dl 1571242000 ref 3 fl Rpc:/2/ffffffff rc 0/-1
            00000100:00000040:34.0:1571241545.717235:0:2942:0:(client.c:2545:__ptlrpc_req_finished()) @@@ refcount now 2  req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 0 dl 1571242000 ref 3 fl Rpc:/2/ffffffff rc 0/-1
            00000100:00000400:29.0:1571242000.355038:0:2956:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1571241545/real 1571241545]  req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 1 dl 1571242000 ref 2 fl Rpc:X/2/ffffffff rc 0/-1
            00000100:00080000:39.0:1571242000.355521:0:2948:0:(client.c:2816:ptlrpc_resend_req()) @@@ going to resend  req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 1 dl 1571242000 ref 2 fl Rpc:X/2/ffffffff rc 0/-1
            00000100:00000200:29.0:1571242000.384628:0:2956:0:(events.c:93:reply_in_callback()) @@@ type 6, status 0  req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 0 dl 1571242000 ref 2 fl Rpc:S/2/ffffffff rc -11/-1
            00000100:00000200:29.0:1571242000.384632:0:2956:0:(events.c:114:reply_in_callback()) @@@ unlink  req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 0 dl 1571242000 ref 2 fl Rpc:S/2/ffffffff rc -11/-1
            00000100:00000200:29.0:1571242000.384646:0:2956:0:(niobuf.c:884:ptl_send_rpc()) Setup reply buffer: 8192 bytes, xid 1647564412455904, portal 4
            00000100:00000040:29.0:1571242000.384648:0:2956:0:(niobuf.c:905:ptl_send_rpc()) @@@ send flg=2  req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 0 dl 1571242455 ref 3 fl Rpc:/2/ffffffff rc -11/-1
            00000100:00000200:29.0:1571242000.384653:0:2956:0:(niobuf.c:85:ptl_send_buf()) Sending 488 bytes to portal 6, xid 1647564412455904, offset 0
            00000100:00000200:34.0:1571242000.384688:0:2942:0:(events.c:57:request_out_callback()) @@@ type 5, status 0  req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 0 dl 1571242455 ref 3 fl Rpc:/2/ffffffff rc -11/-1
            00000100:00000040:34.0:1571242000.384692:0:2942:0:(client.c:2545:__ptlrpc_req_finished()) @@@ refcount now 2  req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 0 dl 1571242455 ref 3 fl Rpc:/2/ffffffff rc -11/-1
            00000100:00000400:8.0:1571242455.015047:0:2956:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1571242000/real 1571242000]  req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 1 dl 1571242455 ref 2 fl Rpc:X/2/ffffffff rc -11/-1
            00000100:00000200:8.0:1571242455.015105:0:2956:0:(events.c:93:reply_in_callback()) @@@ type 6, status 0  req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 1 dl 1571242455 ref 2 fl Rpc:X/2/ffffffff rc -11/-1
            00000100:00000200:8.0:1571242455.015118:0:2956:0:(events.c:114:reply_in_callback()) @@@ unlink  req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 1 dl 1571242455 ref 2 fl Rpc:X/2/ffffffff rc -11/-1
            00000100:00080000:39.0:1571242455.074990:0:2948:0:(recover.c:223:ptlrpc_wake_delayed()) @@@ waking (set ffff9b20dda1f800):  req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 1 dl 1571242455 ref 2 fl Rpc:X/2/ffffffff rc -11/-1
            00000100:00000200:8.0:1571242455.075055:0:2956:0:(niobuf.c:884:ptl_send_rpc()) Setup reply buffer: 8192 bytes, xid 1647564412455904, portal 4
            00000100:00000040:8.0:1571242455.075063:0:2956:0:(niobuf.c:905:ptl_send_rpc()) @@@ send flg=2  req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 0 dl 1571242910 ref 3 fl Rpc:/2/ffffffff rc -11/-1
            00000100:00000200:8.0:1571242455.075092:0:2956:0:(niobuf.c:85:ptl_send_buf()) Sending 488 bytes to portal 6, xid 1647564412455904, offset 0
            00000100:00000200:14.0:1571242455.075206:0:2941:0:(events.c:57:request_out_callback()) @@@ type 5, status 0  req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 0 dl 1571242910 ref 3 fl Rpc:/2/ffffffff rc -11/-1
            00000100:00000040:14.0:1571242455.075215:0:2941:0:(client.c:2545:__ptlrpc_req_finished()) @@@ refcount now 2  req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 0 dl 1571242910 ref 3 fl Rpc:/2/ffffffff rc -11/-1
            
            
            mhanafi Mahmoud Hanafi added a comment - x1646705467974160 is from client (10.151.10.96@o2ib) That was the client stuck from yesterday. I should have cleaned up that one before retesting. We are looking at r407i2n14 has nid 10.151.7.114@o2ib But I think the client logs start after the recover. If you can't get any useful info I can retry and capture longer logs on the client. This is the time line 1571240785 Server reboot 1571241024 Remount and recovery 1571241068.466059 Recovery finished One of XID stuck (x1647564412455904) SERVER LOG 00010000:00020000:11.0:1571241290.559039:0:9679:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk READ after 100+0s req@ffff8d3e5c84b450 x1647564412455904/t0(0) o3->ae2a485b-93c9-11e3-740d-6b2550d66f21@10.151.7.114@o2ib:275/0 lens 488/432 e 0 to 0 dl 1571241345 ref 1 fl Interpret:/0/0 rc 0/0 00010000:00020000:15.0:1571241290.559177:0:10654:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk READ after 100+0s req@ffff8d3e5c849c50 x1647564412456000/t0(0) o3->ae2a485b-93c9-11e3-740d-6b2550d66f21@10.151.7.114@o2ib:275/0 lens 488/432 e 0 to 0 dl 1571241345 ref 1 fl Interpret:/0/0 rc 0/0 00010000:00020000:15.0:1571241290.581931:0:10321:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk READ after 100+0s req@ffff8d3e5c84d050 x1647564412455712/t0(0) o3->ae2a485b-93c9-11e3-740d-6b2550d66f21@10.151.7.114@o2ib:275/0 lens 488/432 e 0 to 0 dl 1571241345 ref 1 fl Interpret:/0/0 rc 0/0 (r407i2n14) Client Log 00000100:00080000:39.0:1571241090.523097:0:2948:0:(recover.c:223:ptlrpc_wake_delayed()) @@@ waking (set ffff9b20dda1f800): req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1 00000100:00000200:29.0:1571241090.523223:0:2956:0:(niobuf.c:884:ptl_send_rpc()) Setup reply buffer: 8192 bytes, xid 1647564412455904, portal 4 00000100:00000040:29.0:1571241090.523227:0:2956:0:(niobuf.c:905:ptl_send_rpc()) @@@ send flg=0 req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 0 dl 1571241545 ref 3 fl Rpc:/0/ffffffff rc 0/-1 00000100:00000200:29.0:1571241090.523275:0:2956:0:(niobuf.c:85:ptl_send_buf()) Sending 488 bytes to portal 6, xid 1647564412455904, offset 0 00000100:00000200:17.0:1571241090.523405:0:2940:0:(events.c:57:request_out_callback()) @@@ type 5, status 0 req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 0 dl 1571241545 ref 3 fl Rpc:/0/ffffffff rc 0/-1 00000100:00000040:17.0:1571241090.523414:0:2940:0:(client.c:2545:__ptlrpc_req_finished()) @@@ refcount now 2 req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 0 dl 1571241545 ref 3 fl Rpc:/0/ffffffff rc 0/-1 00000100:00000400:24.0:1571241545.699041:0:2956:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1571241090/real 1571241090] req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 1 dl 1571241545 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 00000100:00000200:24.0:1571241545.699047:0:2956:0:(events.c:93:reply_in_callback()) @@@ type 6, status 0 req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 1 dl 1571241545 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 00000100:00000200:24.0:1571241545.699051:0:2956:0:(events.c:114:reply_in_callback()) @@@ unlink req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 1 dl 1571241545 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 00000100:00080000:39.0:1571241545.717146:0:2948:0:(recover.c:223:ptlrpc_wake_delayed()) @@@ waking (set ffff9b20dda1f800): req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 1 dl 1571241545 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 00000100:00000200:24.0:1571241545.717171:0:2956:0:(niobuf.c:884:ptl_send_rpc()) Setup reply buffer: 8192 bytes, xid 1647564412455904, portal 4 00000100:00000040:24.0:1571241545.717173:0:2956:0:(niobuf.c:905:ptl_send_rpc()) @@@ send flg=2 req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 0 dl 1571242000 ref 3 fl Rpc:/2/ffffffff rc 0/-1 00000100:00000200:24.0:1571241545.717179:0:2956:0:(niobuf.c:85:ptl_send_buf()) Sending 488 bytes to portal 6, xid 1647564412455904, offset 0 00000100:00000200:34.0:1571241545.717231:0:2942:0:(events.c:57:request_out_callback()) @@@ type 5, status 0 req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 0 dl 1571242000 ref 3 fl Rpc:/2/ffffffff rc 0/-1 00000100:00000040:34.0:1571241545.717235:0:2942:0:(client.c:2545:__ptlrpc_req_finished()) @@@ refcount now 2 req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 0 dl 1571242000 ref 3 fl Rpc:/2/ffffffff rc 0/-1 00000100:00000400:29.0:1571242000.355038:0:2956:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1571241545/real 1571241545] req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 1 dl 1571242000 ref 2 fl Rpc:X/2/ffffffff rc 0/-1 00000100:00080000:39.0:1571242000.355521:0:2948:0:(client.c:2816:ptlrpc_resend_req()) @@@ going to resend req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 1 dl 1571242000 ref 2 fl Rpc:X/2/ffffffff rc 0/-1 00000100:00000200:29.0:1571242000.384628:0:2956:0:(events.c:93:reply_in_callback()) @@@ type 6, status 0 req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 0 dl 1571242000 ref 2 fl Rpc:S/2/ffffffff rc -11/-1 00000100:00000200:29.0:1571242000.384632:0:2956:0:(events.c:114:reply_in_callback()) @@@ unlink req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 0 dl 1571242000 ref 2 fl Rpc:S/2/ffffffff rc -11/-1 00000100:00000200:29.0:1571242000.384646:0:2956:0:(niobuf.c:884:ptl_send_rpc()) Setup reply buffer: 8192 bytes, xid 1647564412455904, portal 4 00000100:00000040:29.0:1571242000.384648:0:2956:0:(niobuf.c:905:ptl_send_rpc()) @@@ send flg=2 req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 0 dl 1571242455 ref 3 fl Rpc:/2/ffffffff rc -11/-1 00000100:00000200:29.0:1571242000.384653:0:2956:0:(niobuf.c:85:ptl_send_buf()) Sending 488 bytes to portal 6, xid 1647564412455904, offset 0 00000100:00000200:34.0:1571242000.384688:0:2942:0:(events.c:57:request_out_callback()) @@@ type 5, status 0 req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 0 dl 1571242455 ref 3 fl Rpc:/2/ffffffff rc -11/-1 00000100:00000040:34.0:1571242000.384692:0:2942:0:(client.c:2545:__ptlrpc_req_finished()) @@@ refcount now 2 req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 0 dl 1571242455 ref 3 fl Rpc:/2/ffffffff rc -11/-1 00000100:00000400:8.0:1571242455.015047:0:2956:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1571242000/real 1571242000] req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 1 dl 1571242455 ref 2 fl Rpc:X/2/ffffffff rc -11/-1 00000100:00000200:8.0:1571242455.015105:0:2956:0:(events.c:93:reply_in_callback()) @@@ type 6, status 0 req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 1 dl 1571242455 ref 2 fl Rpc:X/2/ffffffff rc -11/-1 00000100:00000200:8.0:1571242455.015118:0:2956:0:(events.c:114:reply_in_callback()) @@@ unlink req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 1 dl 1571242455 ref 2 fl Rpc:X/2/ffffffff rc -11/-1 00000100:00080000:39.0:1571242455.074990:0:2948:0:(recover.c:223:ptlrpc_wake_delayed()) @@@ waking (set ffff9b20dda1f800): req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 1 dl 1571242455 ref 2 fl Rpc:X/2/ffffffff rc -11/-1 00000100:00000200:8.0:1571242455.075055:0:2956:0:(niobuf.c:884:ptl_send_rpc()) Setup reply buffer: 8192 bytes, xid 1647564412455904, portal 4 00000100:00000040:8.0:1571242455.075063:0:2956:0:(niobuf.c:905:ptl_send_rpc()) @@@ send flg=2 req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 0 dl 1571242910 ref 3 fl Rpc:/2/ffffffff rc -11/-1 00000100:00000200:8.0:1571242455.075092:0:2956:0:(niobuf.c:85:ptl_send_buf()) Sending 488 bytes to portal 6, xid 1647564412455904, offset 0 00000100:00000200:14.0:1571242455.075206:0:2941:0:(events.c:57:request_out_callback()) @@@ type 5, status 0 req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 0 dl 1571242910 ref 3 fl Rpc:/2/ffffffff rc -11/-1 00000100:00000040:14.0:1571242455.075215:0:2941:0:(client.c:2545:__ptlrpc_req_finished()) @@@ refcount now 2 req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 0 dl 1571242910 ref 3 fl Rpc:/2/ffffffff rc -11/-1

            I'm seeing the following in the server logs:

            6794493 00000020:00000040:7.0:1571241545.526189:0:10654:0:(genops.c:884:class_conn2export()) looking for export cookie 0x438c0068703e8998
            6794494 00000020:00000040:7.0:1571241545.526191:0:10654:0:(genops.c:971:class_export_get()) GETting export ffff8d363c5db000 : new refcount 6
            6794495 00010000:00000040:7.0:1571241545.526203:0:10654:0:(ldlm_resource.c:1487:ldlm_resource_getref()) getref res: ffff8d3f1dd94300 count: 2
            6794496 00010000:00000040:7.0:1571241545.526206:0:10654:0:(ldlm_resource.c:1526:ldlm_resource_putref()) putref res: ffff8d3f1dd94300 count: 1
            6794497 00000100:00000040:7.0:1571241545.526215:0:10654:0:(service.c:1840:ptlrpc_server_request_get()) RPC GETting export ffff8d363c5db000 : new rpc_count 1
            6794498 00000100:00000040:7.0:1571241545.526219:0:10654:0:(lustre_net.h:2481:ptlrpc_rqphase_move()) @@@ move req "New" -> "Interpret"  req@ffff8d3e5c84ac50 x1646705467974160/        t0(0) o4->b9e645b5-be32-890c-4f1d-ef22af2686a1@10.151.10.96@o2ib:730/0 lens 4584/0 e 0 to 0 dl 1571241800 ref 1 fl New:/2/ffffffff rc 0/-1
            6794499 00000100:00000200:7.0:1571241545.526228:0:10654:0:(service.c:2094:ptlrpc_server_handle_request()) got req 1646705467974160
            6794500 00000020:00000040:7.0:1571241545.526254:0:10654:0:(genops.c:971:class_export_get()) GETting export ffff8d363c5db000 : new refcount 7
            6794501 00000400:00000200:7.0:1571241545.526261:0:10654:0:(lib-move.c:2998:LNetGet()) LNetGet -> 12345-10.151.10.96@o2ib
            6794502 00000400:00000200:7.0:1571241545.526273:0:10654:0:(lib-move.c:1930:lnet_select_pathway()) TRACE: 10.151.27.39@o2ib(10.151.27.39@o2ib:10.151.27.39@o2ib) -> 10.151.10.9        6@o2ib(10.151.10.96@o2ib:10.151.10.96@o2ib) : GET
            6794503 00000800:00000200:7.0:1571241545.526278:0:10654:0:(o2iblnd_cb.c:1510:kiblnd_send()) sending 0 bytes in 0 frags to 12345-10.151.10.96@o2ib
            6794504 00000800:00000200:7.0:1571241545.526281:0:10654:0:(o2iblnd_cb.c:703:kiblnd_setup_rd_kiov()) niov 1 offset 0 nob 4096
            6794505 00000400:00000200:7.0:1571241545.526289:0:10654:0:(lib-move.c:2877:lnet_create_reply_msg()) 10.151.27.39@o2ib: Reply from 12345-10.151.10.96@o2ib md ffff8d3e5cf61400
            6794506 00000800:00000200:7.0:1571241545.526293:0:10654:0:(o2iblnd.c:405:kiblnd_find_peer_locked()) got peer_ni [ffff8d3f1bab2e00] -> 10.151.10.96@o2ib (2) version: 12
            6794507 00000800:00000200:7.0:1571241545.526296:0:10654:0:(o2iblnd_cb.c:1391:kiblnd_launch_tx()) conn[ffff8d3efb3dba00] (68)++
            

            Later on (based on the connection ID) it appears like the client is responding with -ENODATA.

            6794519 00000800:00000200:3.0:1571241545.526389:0:6062:0:(o2iblnd_cb.c:74:kiblnd_tx_done()) conn[ffff8d3efb3dba00] (71)--
            6794520 00000100:00020000:3.0:1571241545.526392:0:6062:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff8d363c04aa00
            6794521 00000100:00020000:3.0:1571241545.538264:0:6062:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff8d363c04aa00
            6794522 00000400:00000200:3.0:1571241545.550138:0:6062:0:(lib-md.c:69:lnet_md_unlink()) Unlinking md ffff8d3e5cf61400
            6794523 00000800:00000200:3.0:1571241545.550141:0:6062:0:(o2iblnd_cb.c:194:kiblnd_post_rx()) conn[ffff8d3efb3dba00] (69)++
            6794524 00000800:00000200:3.0:1571241545.550143:0:6062:0:(o2iblnd_cb.c:228:kiblnd_post_rx()) conn[ffff8d3efb3dba00] (70)--
            6794525 00010000:00020000:9.0:1571241545.550145:0:10654:0:(ldlm_lib.c:3255:target_bulk_io()) @@@ network error on bulk WRITE  req@ffff8d3e5c84ac50 x1646705467974160/t0(0) o4-        >b9e645b5-be32-890c-4f1d-ef22af2686a1@10.151.10.96@o2ib:730/0 lens 4584/448 e 0 to 0 dl 1571241800 ref 1 fl Interpret:/2/0 rc 0/0
            6794526 00000800:00000200:3.0:1571241545.550145:0:6062:0:(o2iblnd_cb.c:3663:kiblnd_scheduler()) conn[ffff8d3efb3dba00] (69)--
            6794527 00000020:00000040:9.0:1571241545.577696:0:10654:0:(genops.c:981:class_export_put()) PUTting export ffff8d363c5db000 : new refcount 6
            6794528 00000020:02000400:9.0:1571241545.577703:0:10654:0:(tgt_handler.c:2347:tgt_brw_write()) nbptest-OST0004: Bulk IO write error with b9e645b5-be32-890c-4f1d-ef22af2686a1         (at 10.151.10.96@o2ib), client will retry: rc = -110
            6794529 00010000:00000080:9.0:1571241545.592859:0:10654:0:(ldlm_lib.c:2883:target_committed_to_req()) @@@ not sending last_committed update (0/1)  req@ffff8d3e5c84ac50 x16467        05467974160/t0(0) o4->b9e645b5-be32-890c-4f1d-ef22af2686a1@10.151.10.96@o2ib:730/0 lens 4584/448 e 0 to 0 dl 1571241800 ref 1 fl Interpret:/2/ffffffff rc -110/-1
            6794530 00010000:00000040:9.0:1571241545.592868:0:10654:0:(ldlm_lib.c:2886:target_committed_to_req()) last_committed 17179869458, transno 0, xid 1646705467974160
            6794531 00000100:00000040:9.0:1571241545.592871:0:10654:0:(lustre_net.h:2481:ptlrpc_rqphase_move()) @@@ move req "Interpret" -> "Complete"  req@ffff8d3e5c84ac50 x164670546797        4160/t0(0) o4->b9e645b5-be32-890c-4f1d-ef22af2686a1@10.151.10.96@o2ib:730/0 lens 4584/448 e 0 to 0 dl 1571241800 ref 1 fl Interpret:/2/ffffffff rc -110/-1
            6794532 00000100:00000040:9.0:1571241545.592893:0:10654:0:(service.c:1034:ptlrpc_server_finish_active_request()) RPC PUTting export ffff8d363c5db000 : new rpc_count 0
            6794533 00010000:00000040:9.0:1571241545.592897:0:10654:0:(ldlm_resource.c:1487:ldlm_resource_getref()) getref res: ffff8d3f1dd94300 count: 2
            6794534 00010000:00000040:9.0:1571241545.592900:0:10654:0:(ldlm_resource.c:1526:ldlm_resource_putref()) putref res: ffff8d3f1dd94300 count: 1
            6794535 00000020:00000040:9.0:1571241545.592903:0:10654:0:(genops.c:981:class_export_put()) PUTting export ffff8d363c5db000 : new refcount 5
            6794536 00000100:00000200:9.0:1571241545.592906:0:10654:0:(niobuf.c:953:ptlrpc_register_rqbd()) LNetMEAttach: portal 6
            

            The server reports:

            (tgt_handler.c:2347:tgt_brw_write()) nbptest-OST0004: Bulk IO write error with b9e645b5-be32-890c-4f1d-ef22af2686a1         (at 10.151.10.96@o2ib), client will retry: rc = -110
            (ldlm_lib.c:2883:target_committed_to_req()) @@@ not sending last_committed update 

            This happens in the following scenario

            Server --GET (MD cookie) --> client
            Client tries to look up the MD but can't find it
            Server <--GET_DONE (-ENODATA) -- client

            Did the server reboot happen around: Wed Oct 16 08:59:05 PDT 2019

            However, this xid (x1646705467974160) in this log snippet is not to/from the client which the attached client logs belong to. It's to/from 10.151.10.96@o2ib. Can we confirm if this client is hung on that request?

            Do you know which particular xid is stuck in the logs provided?

            ashehata Amir Shehata (Inactive) added a comment - I'm seeing the following in the server logs: 6794493 00000020:00000040:7.0:1571241545.526189:0:10654:0:(genops.c:884:class_conn2export()) looking for export cookie 0x438c0068703e8998 6794494 00000020:00000040:7.0:1571241545.526191:0:10654:0:(genops.c:971:class_export_get()) GETting export ffff8d363c5db000 : new refcount 6 6794495 00010000:00000040:7.0:1571241545.526203:0:10654:0:(ldlm_resource.c:1487:ldlm_resource_getref()) getref res: ffff8d3f1dd94300 count: 2 6794496 00010000:00000040:7.0:1571241545.526206:0:10654:0:(ldlm_resource.c:1526:ldlm_resource_putref()) putref res: ffff8d3f1dd94300 count: 1 6794497 00000100:00000040:7.0:1571241545.526215:0:10654:0:(service.c:1840:ptlrpc_server_request_get()) RPC GETting export ffff8d363c5db000 : new rpc_count 1 6794498 00000100:00000040:7.0:1571241545.526219:0:10654:0:(lustre_net.h:2481:ptlrpc_rqphase_move()) @@@ move req "New" -> "Interpret" req@ffff8d3e5c84ac50 x1646705467974160/ t0(0) o4->b9e645b5-be32-890c-4f1d-ef22af2686a1@10.151.10.96@o2ib:730/0 lens 4584/0 e 0 to 0 dl 1571241800 ref 1 fl New:/2/ffffffff rc 0/-1 6794499 00000100:00000200:7.0:1571241545.526228:0:10654:0:(service.c:2094:ptlrpc_server_handle_request()) got req 1646705467974160 6794500 00000020:00000040:7.0:1571241545.526254:0:10654:0:(genops.c:971:class_export_get()) GETting export ffff8d363c5db000 : new refcount 7 6794501 00000400:00000200:7.0:1571241545.526261:0:10654:0:(lib-move.c:2998:LNetGet()) LNetGet -> 12345-10.151.10.96@o2ib 6794502 00000400:00000200:7.0:1571241545.526273:0:10654:0:(lib-move.c:1930:lnet_select_pathway()) TRACE: 10.151.27.39@o2ib(10.151.27.39@o2ib:10.151.27.39@o2ib) -> 10.151.10.9 6@o2ib(10.151.10.96@o2ib:10.151.10.96@o2ib) : GET 6794503 00000800:00000200:7.0:1571241545.526278:0:10654:0:(o2iblnd_cb.c:1510:kiblnd_send()) sending 0 bytes in 0 frags to 12345-10.151.10.96@o2ib 6794504 00000800:00000200:7.0:1571241545.526281:0:10654:0:(o2iblnd_cb.c:703:kiblnd_setup_rd_kiov()) niov 1 offset 0 nob 4096 6794505 00000400:00000200:7.0:1571241545.526289:0:10654:0:(lib-move.c:2877:lnet_create_reply_msg()) 10.151.27.39@o2ib: Reply from 12345-10.151.10.96@o2ib md ffff8d3e5cf61400 6794506 00000800:00000200:7.0:1571241545.526293:0:10654:0:(o2iblnd.c:405:kiblnd_find_peer_locked()) got peer_ni [ffff8d3f1bab2e00] -> 10.151.10.96@o2ib (2) version: 12 6794507 00000800:00000200:7.0:1571241545.526296:0:10654:0:(o2iblnd_cb.c:1391:kiblnd_launch_tx()) conn[ffff8d3efb3dba00] (68)++ Later on (based on the connection ID) it appears like the client is responding with -ENODATA. 6794519 00000800:00000200:3.0:1571241545.526389:0:6062:0:(o2iblnd_cb.c:74:kiblnd_tx_done()) conn[ffff8d3efb3dba00] (71)-- 6794520 00000100:00020000:3.0:1571241545.526392:0:6062:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff8d363c04aa00 6794521 00000100:00020000:3.0:1571241545.538264:0:6062:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff8d363c04aa00 6794522 00000400:00000200:3.0:1571241545.550138:0:6062:0:(lib-md.c:69:lnet_md_unlink()) Unlinking md ffff8d3e5cf61400 6794523 00000800:00000200:3.0:1571241545.550141:0:6062:0:(o2iblnd_cb.c:194:kiblnd_post_rx()) conn[ffff8d3efb3dba00] (69)++ 6794524 00000800:00000200:3.0:1571241545.550143:0:6062:0:(o2iblnd_cb.c:228:kiblnd_post_rx()) conn[ffff8d3efb3dba00] (70)-- 6794525 00010000:00020000:9.0:1571241545.550145:0:10654:0:(ldlm_lib.c:3255:target_bulk_io()) @@@ network error on bulk WRITE req@ffff8d3e5c84ac50 x1646705467974160/t0(0) o4- >b9e645b5-be32-890c-4f1d-ef22af2686a1@10.151.10.96@o2ib:730/0 lens 4584/448 e 0 to 0 dl 1571241800 ref 1 fl Interpret:/2/0 rc 0/0 6794526 00000800:00000200:3.0:1571241545.550145:0:6062:0:(o2iblnd_cb.c:3663:kiblnd_scheduler()) conn[ffff8d3efb3dba00] (69)-- 6794527 00000020:00000040:9.0:1571241545.577696:0:10654:0:(genops.c:981:class_export_put()) PUTting export ffff8d363c5db000 : new refcount 6 6794528 00000020:02000400:9.0:1571241545.577703:0:10654:0:(tgt_handler.c:2347:tgt_brw_write()) nbptest-OST0004: Bulk IO write error with b9e645b5-be32-890c-4f1d-ef22af2686a1 (at 10.151.10.96@o2ib), client will retry: rc = -110 6794529 00010000:00000080:9.0:1571241545.592859:0:10654:0:(ldlm_lib.c:2883:target_committed_to_req()) @@@ not sending last_committed update (0/1) req@ffff8d3e5c84ac50 x16467 05467974160/t0(0) o4->b9e645b5-be32-890c-4f1d-ef22af2686a1@10.151.10.96@o2ib:730/0 lens 4584/448 e 0 to 0 dl 1571241800 ref 1 fl Interpret:/2/ffffffff rc -110/-1 6794530 00010000:00000040:9.0:1571241545.592868:0:10654:0:(ldlm_lib.c:2886:target_committed_to_req()) last_committed 17179869458, transno 0, xid 1646705467974160 6794531 00000100:00000040:9.0:1571241545.592871:0:10654:0:(lustre_net.h:2481:ptlrpc_rqphase_move()) @@@ move req "Interpret" -> "Complete" req@ffff8d3e5c84ac50 x164670546797 4160/t0(0) o4->b9e645b5-be32-890c-4f1d-ef22af2686a1@10.151.10.96@o2ib:730/0 lens 4584/448 e 0 to 0 dl 1571241800 ref 1 fl Interpret:/2/ffffffff rc -110/-1 6794532 00000100:00000040:9.0:1571241545.592893:0:10654:0:(service.c:1034:ptlrpc_server_finish_active_request()) RPC PUTting export ffff8d363c5db000 : new rpc_count 0 6794533 00010000:00000040:9.0:1571241545.592897:0:10654:0:(ldlm_resource.c:1487:ldlm_resource_getref()) getref res: ffff8d3f1dd94300 count: 2 6794534 00010000:00000040:9.0:1571241545.592900:0:10654:0:(ldlm_resource.c:1526:ldlm_resource_putref()) putref res: ffff8d3f1dd94300 count: 1 6794535 00000020:00000040:9.0:1571241545.592903:0:10654:0:(genops.c:981:class_export_put()) PUTting export ffff8d363c5db000 : new refcount 5 6794536 00000100:00000200:9.0:1571241545.592906:0:10654:0:(niobuf.c:953:ptlrpc_register_rqbd()) LNetMEAttach: portal 6 The server reports: (tgt_handler.c:2347:tgt_brw_write()) nbptest-OST0004: Bulk IO write error with b9e645b5-be32-890c-4f1d-ef22af2686a1 (at 10.151.10.96@o2ib), client will retry: rc = -110 (ldlm_lib.c:2883:target_committed_to_req()) @@@ not sending last_committed update This happens in the following scenario Server --GET (MD cookie) --> client Client tries to look up the MD but can't find it Server <--GET_DONE (-ENODATA) -- client Did the server reboot happen around: Wed Oct 16 08:59:05 PDT 2019 However, this xid (x1646705467974160) in this log snippet is not to/from the client which the attached client logs belong to. It's to/from 10.151.10.96@o2ib. Can we confirm if this client is hung on that request? Do you know which particular xid is stuck in the logs provided?

            I'm looking at the logs. In the meantime, can we verify that the problem doesn't occur if you use a 2.10.8 client?

            ashehata Amir Shehata (Inactive) added a comment - I'm looking at the logs. In the meantime, can we verify that the problem doesn't occur if you use a 2.10.8 client?

            I tried l2.12.2 with mofed4.5 and got the same error. I was able to get debug dump from the client side

            I killed my job and it left files open.

            r407i2n14 ~ # lsof /nobackuptest/
            COMMAND  PID    USER   FD   TYPE     DEVICE SIZE/OFF               NODE NAME
            IOR     3965 mhanafi  cwd    DIR 523,627074    16384 144115272381366273 /nobackuptest/mhanafi/testdir
            IOR     3965 mhanafi   11u   REG 523,627074 31457280 144115272398143530 /nobackuptest/mhanafi/testdir/testFile.00000031
            IOR     3966 mhanafi  cwd    DIR 523,627074    16384 144115272381366273 /nobackuptest/mhanafi/testdir
            IOR     3966 mhanafi   11u   REG 523,627074 31457280 144115272398143535 /nobackuptest/mhanafi/testdir/testFile.00000032
            IOR     3968 mhanafi  cwd    DIR 523,627074    16384 144115272381366273 /nobackuptest/mhanafi/testdir
            IOR     3968 mhanafi   11u   REG 523,627074 31457280 144115272398143537 /nobackuptest/mhanafi/testdir/testFile.00000034
            IOR     3969 mhanafi  cwd    DIR 523,627074    16384 144115272381366273 /nobackuptest/mhanafi/testdir
            IOR     3969 mhanafi   11u   REG 523,627074 31457280 144115272398143531 /nobackuptest/mhanafi/testdir/testFile.00000035
            IOR     3972 mhanafi  cwd    DIR 523,627074    16384 144115272381366273 /nobackuptest/mhanafi/testdir
            IOR     3972 mhanafi   11u   REG 523,627074 31457280 144115272398143533 /nobackuptest/mhanafi/testdir/testFile.00000038
            IOR     3973 mhanafi  cwd    DIR 523,627074    16384 144115272381366273 /nobackuptest/mhanafi/testdir
            IOR     3973 mhanafi   11u   REG 523,627074 31457280 144115272398143532 /nobackuptest/mhanafi/testdir/testFile.00000039
            IOR     3977 mhanafi  cwd    DIR 523,627074    16384 144115272381366273 /nobackuptest/mhanafi/testdir
            IOR     3977 mhanafi   11u   REG 523,627074 31457280 144115272398143541 /nobackuptest/mhanafi/testdir/testFile.00000133
            IOR     3978 mhanafi  cwd    DIR 523,627074    16384 144115272381366273 /nobackuptest/mhanafi/testdir
            IOR     3978 mhanafi   11u   REG 523,627074 31457280 144115272398143539 /nobackuptest/mhanafi/testdir/testFile.00000134
            

            here is logs from r407i2n14

            Oct 16 08:33:01 r407i2n14 kernel: [1571239981.657626] Lustre: Mounted nbptest-client
            Oct 16 08:36:38 r407i2n14 kernel: [1571240198.107290] LNet: 3987:0:(debug.c:370:libcfs_debug_str2mask()) You are trying to use a numerical value for the mask - this will be deprecated in a future release.
            Oct 16 08:39:17 r407i2n14 kernel: [1571240357.957632] LustreError: 11-0: nbptest-OST0002-osc-ffff9b1c83b70800: operation ost_read to node 10.151.27.39@o2ib failed: rc = -19
            Oct 16 08:39:17 r407i2n14 kernel: [1571240357.957632] Lustre: nbptest-OST0002-osc-ffff9b1c83b70800: Connection to nbptest-OST0002 (at 10.151.27.39@o2ib) was lost; in progress operations using this service will wait for recovery to complete
            Oct 16 08:39:17 r407i2n14 kernel: [1571240357.957632] LustreError: Skipped 5 previous similar messages
            Oct 16 08:40:38 r407i2n14 kernel: [1571240438.739797] LNetError: 2935:0:(o2iblnd_cb.c:3335:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds
            Oct 16 08:40:38 r407i2n14 kernel: [1571240438.739797] LNetError: 2935:0:(o2iblnd_cb.c:3410:kiblnd_check_conns()) Timed out RDMA with 10.151.27.39@o2ib (32): c: 28, oc: 0, rc: 32
            Oct 16 08:40:38 r407i2n14 kernel: [1571240438.787796] Lustre: 2959:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1571240433/real 1571240438]  req@ffff9b1f1f58cdc0 x1647564412471488/t0(0) o400->MGC10.151.27.39@o2ib@10.151.27.39@o2ib:26/25 lens 224/224 e 0 to 1 dl 1571240888 ref 2 fl Rpc:eXN/0/ffffffff rc 0/-1
            Oct 16 08:40:38 r407i2n14 kernel: [1571240438.787796] Lustre: nbptest-OST0005-osc-ffff9b1c83b70800: Connection to nbptest-OST0005 (at 10.151.27.39@o2ib) was lost; in progress operations using this service will wait for recovery to complete
            Oct 16 08:40:38 r407i2n14 kernel: [1571240438.787796] Lustre: 2959:0:(client.c:2134:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
            Oct 16 08:40:38 r407i2n14 kernel: [1571240438.787796] Lustre: Skipped 4 previous similar messages
            Oct 16 08:40:38 r407i2n14 kernel: [1571240438.819795] LustreError: 166-1: MGC10.151.27.39@o2ib: Connection to MGS (at 10.151.27.39@o2ib) was lost; in progress operations using this service will fail
            Oct 16 08:47:15 r407i2n14 kernel: [1571240835.082847] Lustre: 2966:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1571240380/real 1571240380]  req@ffff9b1f1faf2000 x1647564412461280/t0(0) o400->nbptest-OST0000-osc-ffff9b1c83b70800@10.151.27.39@o2ib:28/4 lens 224/224 e 0 to 1 dl 1571240835 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
            Oct 16 08:47:15 r407i2n14 kernel: [1571240835.082847] Lustre: 2966:0:(client.c:2134:ptlrpc_expire_one_request()) Skipped 1 previous similar message
            Oct 16 08:47:42 r407i2n14 kernel: [1571240862.726231] Lustre: 2963:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1571240407/real 1571240407]  req@ffff9b1f1f444940 x1647564412466432/t0(0) o400->nbptest-OST0000-osc-ffff9b1c83b70800@10.151.27.39@o2ib:28/4 lens 224/224 e 0 to 1 dl 1571240862 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
            Oct 16 08:47:42 r407i2n14 kernel: [1571240862.726231] Lustre: 2963:0:(client.c:2134:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
            Oct 16 08:50:21 r407i2n14 kernel: [1571241021.438678] Lustre: Evicted from MGS (at 10.151.27.39@o2ib) after server handle changed from 0x8abce496dbe36dab to 0x438c0068703e8a86
            Oct 16 08:50:21 r407i2n14 kernel: [1571241021.438678] Lustre: MGC10.151.27.39@o2ib: Connection restored to 10.151.27.39@o2ib (at 10.151.27.39@o2ib)
            Oct 16 08:50:21 r407i2n14 kernel: [1571241021.694672] LustreError: 2948:0:(client.c:3025:ptlrpc_replay_interpret()) @@@ status 301, old was 0  req@ffff9b27bcc17200 x1647564409414608/t12884904463(12884904463) o101->nbptest-MDT0000-mdc-ffff9b1c83b70800@10.151.27.39@o2ib:12/10 lens 616/544 e 0 to 0 dl 1571241476 ref 2 fl Interpret:RP/4/0 rc 301/301
            Oct 16 08:50:25 r407i2n14 kernel: [1571241025.850579] Lustre: nbptest-MDT0000-mdc-ffff9b1c83b70800: Connection restored to 10.151.27.39@o2ib (at 10.151.27.39@o2ib)
            Oct 16 08:51:30 r407i2n14 kernel: [1571241090.517125] Lustre: nbptest-OST0005-osc-ffff9b1c83b70800: Connection restored to 10.151.27.39@o2ib (at 10.151.27.39@o2ib)
            Oct 16 08:59:05 r407i2n14 kernel: [1571241545.754914] Lustre: 2968:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1571241090/real 1571241090]  req@ffff9b20b34236c0 x1647564412456208/t0(0) o3->nbptest-OST0003-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 1 dl 1571241545 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
            Oct 16 08:59:05 r407i2n14 kernel: [1571241545.754914] Lustre: 2962:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1571241090/real 1571241090]  req@ffff9b1f3c7544c0 x1647564412456112/t0(0) o3->nbptest-OST0003-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 1 dl 1571241545 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
            Oct 16 08:59:05 r407i2n14 kernel: [1571241545.754914] Lustre: 2952:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1571241090/real 1571241090]  req@ffff9b1fc7f784c0 x1647564412455712/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 1 dl 1571241545 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
            Oct 16 08:59:05 r407i2n14 kernel: [1571241545.754914] Lustre: 2962:0:(client.c:2134:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
            Oct 16 08:59:05 r407i2n14 kernel: [1571241545.754914] Lustre: 2952:0:(client.c:2134:ptlrpc_expire_one_request()) Skipped 4 previous similar messages
            Oct 16 08:59:05 r407i2n14 kernel: [1571241545.754914] Lustre: nbptest-OST0001-osc-ffff9b1c83b70800: Connection to nbptest-OST0001 (at 10.151.27.39@o2ib) was lost; in progress operations using this service will wait for recovery to complete
            Oct 16 08:59:05 r407i2n14 kernel: [1571241545.754914] Lustre: Skipped 1 previous similar message
            Oct 16 08:59:05 r407i2n14 kernel: [1571241545.754914] Lustre: nbptest-OST0002-osc-ffff9b1c83b70800: Connection restored to 10.151.27.39@o2ib (at 10.151.27.39@o2ib)
            Oct 16 08:59:05 r407i2n14 kernel: [1571241545.754914] Lustre: Skipped 5 previous similar messages
            Oct 16 08:59:05 r407i2n14 kernel: [1571241545.754914] Lustre: 2968:0:(client.c:2134:ptlrpc_expire_one_request()) Skipped 1 previous similar message
            Oct 16 09:06:40 r407i2n14 kernel: [1571242000.385049] Lustre: 2956:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1571241545/real 1571241545]  req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 1 dl 1571242000 ref 2 fl Rpc:X/2/ffffffff rc 0/-1
            Oct 16 09:06:40 r407i2n14 kernel: [1571242000.385049] Lustre: nbptest-OST0002-osc-ffff9b1c83b70800: Connection to nbptest-OST0002 (at 10.151.27.39@o2ib) was lost; in progress operations using this service will wait for recovery to complete
            Oct 16 09:06:40 r407i2n14 kernel: [1571242000.385049] Lustre: nbptest-OST0001-osc-ffff9b1c83b70800: Connection restored to 10.151.27.39@o2ib (at 10.151.27.39@o2ib)
            Oct 16 09:06:40 r407i2n14 kernel: [1571242000.385049] Lustre: Skipped 2 previous similar messages
            Oct 16 09:06:40 r407i2n14 kernel: [1571242000.385049] Lustre: 2956:0:(client.c:2134:ptlrpc_expire_one_request()) Skipped 7 previous similar messages
            Oct 16 09:06:40 r407i2n14 kernel: [1571242000.413048] Lustre: Skipped 4 previous similar messages
            Oct 16 09:14:15 r407i2n14 kernel: [1571242455.038902] Lustre: 2950:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1571242000/real 1571242000]  req@ffff9b1fc7f3f200 x1647564412454624/t0(0) o3->nbptest-OST0002-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 1 dl 1571242455 ref 2 fl Rpc:X/2/ffffffff rc 0/-1
            Oct 16 09:14:15 r407i2n14 kernel: [1571242455.038902] Lustre: nbptest-OST0003-osc-ffff9b1c83b70800: Connection to nbptest-OST0003 (at 10.151.27.39@o2ib) was lost; in progress operations using this service will wait for recovery to complete
            Oct 16 09:14:15 r407i2n14 kernel: [1571242455.038902] Lustre: nbptest-OST0002-osc-ffff9b1c83b70800: Connection restored to 10.151.27.39@o2ib (at 10.151.27.39@o2ib)
            Oct 16 09:14:15 r407i2n14 kernel: [1571242455.038902] Lustre: Skipped 2 previous similar messages
            Oct 16 09:14:15 r407i2n14 kernel: [1571242455.038902] Lustre: 2950:0:(client.c:2134:ptlrpc_expire_one_request()) Skipped 6 previous similar messages
            Oct 16 09:14:15 r407i2n14 kernel: [1571242455.070902] Lustre: Skipped 2 previous similar messages
            

            Here is debug logs from the client size.
            r407i2n14.out2.gz

            server side logs

            nbp7-mds.2019-10-16.out.gz

            mhanafi Mahmoud Hanafi added a comment - I tried l2.12.2 with mofed4.5 and got the same error. I was able to get debug dump from the client side I killed my job and it left files open. r407i2n14 ~ # lsof /nobackuptest/ COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME IOR 3965 mhanafi cwd DIR 523,627074 16384 144115272381366273 /nobackuptest/mhanafi/testdir IOR 3965 mhanafi 11u REG 523,627074 31457280 144115272398143530 /nobackuptest/mhanafi/testdir/testFile.00000031 IOR 3966 mhanafi cwd DIR 523,627074 16384 144115272381366273 /nobackuptest/mhanafi/testdir IOR 3966 mhanafi 11u REG 523,627074 31457280 144115272398143535 /nobackuptest/mhanafi/testdir/testFile.00000032 IOR 3968 mhanafi cwd DIR 523,627074 16384 144115272381366273 /nobackuptest/mhanafi/testdir IOR 3968 mhanafi 11u REG 523,627074 31457280 144115272398143537 /nobackuptest/mhanafi/testdir/testFile.00000034 IOR 3969 mhanafi cwd DIR 523,627074 16384 144115272381366273 /nobackuptest/mhanafi/testdir IOR 3969 mhanafi 11u REG 523,627074 31457280 144115272398143531 /nobackuptest/mhanafi/testdir/testFile.00000035 IOR 3972 mhanafi cwd DIR 523,627074 16384 144115272381366273 /nobackuptest/mhanafi/testdir IOR 3972 mhanafi 11u REG 523,627074 31457280 144115272398143533 /nobackuptest/mhanafi/testdir/testFile.00000038 IOR 3973 mhanafi cwd DIR 523,627074 16384 144115272381366273 /nobackuptest/mhanafi/testdir IOR 3973 mhanafi 11u REG 523,627074 31457280 144115272398143532 /nobackuptest/mhanafi/testdir/testFile.00000039 IOR 3977 mhanafi cwd DIR 523,627074 16384 144115272381366273 /nobackuptest/mhanafi/testdir IOR 3977 mhanafi 11u REG 523,627074 31457280 144115272398143541 /nobackuptest/mhanafi/testdir/testFile.00000133 IOR 3978 mhanafi cwd DIR 523,627074 16384 144115272381366273 /nobackuptest/mhanafi/testdir IOR 3978 mhanafi 11u REG 523,627074 31457280 144115272398143539 /nobackuptest/mhanafi/testdir/testFile.00000134 here is logs from r407i2n14 Oct 16 08:33:01 r407i2n14 kernel: [1571239981.657626] Lustre: Mounted nbptest-client Oct 16 08:36:38 r407i2n14 kernel: [1571240198.107290] LNet: 3987:0:(debug.c:370:libcfs_debug_str2mask()) You are trying to use a numerical value for the mask - this will be deprecated in a future release. Oct 16 08:39:17 r407i2n14 kernel: [1571240357.957632] LustreError: 11-0: nbptest-OST0002-osc-ffff9b1c83b70800: operation ost_read to node 10.151.27.39@o2ib failed: rc = -19 Oct 16 08:39:17 r407i2n14 kernel: [1571240357.957632] Lustre: nbptest-OST0002-osc-ffff9b1c83b70800: Connection to nbptest-OST0002 (at 10.151.27.39@o2ib) was lost; in progress operations using this service will wait for recovery to complete Oct 16 08:39:17 r407i2n14 kernel: [1571240357.957632] LustreError: Skipped 5 previous similar messages Oct 16 08:40:38 r407i2n14 kernel: [1571240438.739797] LNetError: 2935:0:(o2iblnd_cb.c:3335:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds Oct 16 08:40:38 r407i2n14 kernel: [1571240438.739797] LNetError: 2935:0:(o2iblnd_cb.c:3410:kiblnd_check_conns()) Timed out RDMA with 10.151.27.39@o2ib (32): c: 28, oc: 0, rc: 32 Oct 16 08:40:38 r407i2n14 kernel: [1571240438.787796] Lustre: 2959:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1571240433/real 1571240438] req@ffff9b1f1f58cdc0 x1647564412471488/t0(0) o400->MGC10.151.27.39@o2ib@10.151.27.39@o2ib:26/25 lens 224/224 e 0 to 1 dl 1571240888 ref 2 fl Rpc:eXN/0/ffffffff rc 0/-1 Oct 16 08:40:38 r407i2n14 kernel: [1571240438.787796] Lustre: nbptest-OST0005-osc-ffff9b1c83b70800: Connection to nbptest-OST0005 (at 10.151.27.39@o2ib) was lost; in progress operations using this service will wait for recovery to complete Oct 16 08:40:38 r407i2n14 kernel: [1571240438.787796] Lustre: 2959:0:(client.c:2134:ptlrpc_expire_one_request()) Skipped 2 previous similar messages Oct 16 08:40:38 r407i2n14 kernel: [1571240438.787796] Lustre: Skipped 4 previous similar messages Oct 16 08:40:38 r407i2n14 kernel: [1571240438.819795] LustreError: 166-1: MGC10.151.27.39@o2ib: Connection to MGS (at 10.151.27.39@o2ib) was lost; in progress operations using this service will fail Oct 16 08:47:15 r407i2n14 kernel: [1571240835.082847] Lustre: 2966:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1571240380/real 1571240380] req@ffff9b1f1faf2000 x1647564412461280/t0(0) o400->nbptest-OST0000-osc-ffff9b1c83b70800@10.151.27.39@o2ib:28/4 lens 224/224 e 0 to 1 dl 1571240835 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 16 08:47:15 r407i2n14 kernel: [1571240835.082847] Lustre: 2966:0:(client.c:2134:ptlrpc_expire_one_request()) Skipped 1 previous similar message Oct 16 08:47:42 r407i2n14 kernel: [1571240862.726231] Lustre: 2963:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1571240407/real 1571240407] req@ffff9b1f1f444940 x1647564412466432/t0(0) o400->nbptest-OST0000-osc-ffff9b1c83b70800@10.151.27.39@o2ib:28/4 lens 224/224 e 0 to 1 dl 1571240862 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Oct 16 08:47:42 r407i2n14 kernel: [1571240862.726231] Lustre: 2963:0:(client.c:2134:ptlrpc_expire_one_request()) Skipped 2 previous similar messages Oct 16 08:50:21 r407i2n14 kernel: [1571241021.438678] Lustre: Evicted from MGS (at 10.151.27.39@o2ib) after server handle changed from 0x8abce496dbe36dab to 0x438c0068703e8a86 Oct 16 08:50:21 r407i2n14 kernel: [1571241021.438678] Lustre: MGC10.151.27.39@o2ib: Connection restored to 10.151.27.39@o2ib (at 10.151.27.39@o2ib) Oct 16 08:50:21 r407i2n14 kernel: [1571241021.694672] LustreError: 2948:0:(client.c:3025:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@ffff9b27bcc17200 x1647564409414608/t12884904463(12884904463) o101->nbptest-MDT0000-mdc-ffff9b1c83b70800@10.151.27.39@o2ib:12/10 lens 616/544 e 0 to 0 dl 1571241476 ref 2 fl Interpret:RP/4/0 rc 301/301 Oct 16 08:50:25 r407i2n14 kernel: [1571241025.850579] Lustre: nbptest-MDT0000-mdc-ffff9b1c83b70800: Connection restored to 10.151.27.39@o2ib (at 10.151.27.39@o2ib) Oct 16 08:51:30 r407i2n14 kernel: [1571241090.517125] Lustre: nbptest-OST0005-osc-ffff9b1c83b70800: Connection restored to 10.151.27.39@o2ib (at 10.151.27.39@o2ib) Oct 16 08:59:05 r407i2n14 kernel: [1571241545.754914] Lustre: 2968:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1571241090/real 1571241090] req@ffff9b20b34236c0 x1647564412456208/t0(0) o3->nbptest-OST0003-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 1 dl 1571241545 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Oct 16 08:59:05 r407i2n14 kernel: [1571241545.754914] Lustre: 2962:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1571241090/real 1571241090] req@ffff9b1f3c7544c0 x1647564412456112/t0(0) o3->nbptest-OST0003-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 1 dl 1571241545 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Oct 16 08:59:05 r407i2n14 kernel: [1571241545.754914] Lustre: 2952:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1571241090/real 1571241090] req@ffff9b1fc7f784c0 x1647564412455712/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 1 dl 1571241545 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 Oct 16 08:59:05 r407i2n14 kernel: [1571241545.754914] Lustre: 2962:0:(client.c:2134:ptlrpc_expire_one_request()) Skipped 3 previous similar messages Oct 16 08:59:05 r407i2n14 kernel: [1571241545.754914] Lustre: 2952:0:(client.c:2134:ptlrpc_expire_one_request()) Skipped 4 previous similar messages Oct 16 08:59:05 r407i2n14 kernel: [1571241545.754914] Lustre: nbptest-OST0001-osc-ffff9b1c83b70800: Connection to nbptest-OST0001 (at 10.151.27.39@o2ib) was lost; in progress operations using this service will wait for recovery to complete Oct 16 08:59:05 r407i2n14 kernel: [1571241545.754914] Lustre: Skipped 1 previous similar message Oct 16 08:59:05 r407i2n14 kernel: [1571241545.754914] Lustre: nbptest-OST0002-osc-ffff9b1c83b70800: Connection restored to 10.151.27.39@o2ib (at 10.151.27.39@o2ib) Oct 16 08:59:05 r407i2n14 kernel: [1571241545.754914] Lustre: Skipped 5 previous similar messages Oct 16 08:59:05 r407i2n14 kernel: [1571241545.754914] Lustre: 2968:0:(client.c:2134:ptlrpc_expire_one_request()) Skipped 1 previous similar message Oct 16 09:06:40 r407i2n14 kernel: [1571242000.385049] Lustre: 2956:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1571241545/real 1571241545] req@ffff9b1f72ea3b00 x1647564412455904/t0(0) o3->nbptest-OST0001-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 1 dl 1571242000 ref 2 fl Rpc:X/2/ffffffff rc 0/-1 Oct 16 09:06:40 r407i2n14 kernel: [1571242000.385049] Lustre: nbptest-OST0002-osc-ffff9b1c83b70800: Connection to nbptest-OST0002 (at 10.151.27.39@o2ib) was lost; in progress operations using this service will wait for recovery to complete Oct 16 09:06:40 r407i2n14 kernel: [1571242000.385049] Lustre: nbptest-OST0001-osc-ffff9b1c83b70800: Connection restored to 10.151.27.39@o2ib (at 10.151.27.39@o2ib) Oct 16 09:06:40 r407i2n14 kernel: [1571242000.385049] Lustre: Skipped 2 previous similar messages Oct 16 09:06:40 r407i2n14 kernel: [1571242000.385049] Lustre: 2956:0:(client.c:2134:ptlrpc_expire_one_request()) Skipped 7 previous similar messages Oct 16 09:06:40 r407i2n14 kernel: [1571242000.413048] Lustre: Skipped 4 previous similar messages Oct 16 09:14:15 r407i2n14 kernel: [1571242455.038902] Lustre: 2950:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1571242000/real 1571242000] req@ffff9b1fc7f3f200 x1647564412454624/t0(0) o3->nbptest-OST0002-osc-ffff9b1c83b70800@10.151.27.39@o2ib:6/4 lens 488/4536 e 0 to 1 dl 1571242455 ref 2 fl Rpc:X/2/ffffffff rc 0/-1 Oct 16 09:14:15 r407i2n14 kernel: [1571242455.038902] Lustre: nbptest-OST0003-osc-ffff9b1c83b70800: Connection to nbptest-OST0003 (at 10.151.27.39@o2ib) was lost; in progress operations using this service will wait for recovery to complete Oct 16 09:14:15 r407i2n14 kernel: [1571242455.038902] Lustre: nbptest-OST0002-osc-ffff9b1c83b70800: Connection restored to 10.151.27.39@o2ib (at 10.151.27.39@o2ib) Oct 16 09:14:15 r407i2n14 kernel: [1571242455.038902] Lustre: Skipped 2 previous similar messages Oct 16 09:14:15 r407i2n14 kernel: [1571242455.038902] Lustre: 2950:0:(client.c:2134:ptlrpc_expire_one_request()) Skipped 6 previous similar messages Oct 16 09:14:15 r407i2n14 kernel: [1571242455.070902] Lustre: Skipped 2 previous similar messages Here is debug logs from the client size. r407i2n14.out2.gz server side logs nbp7-mds.2019-10-16.out.gz

            People

              tappro Mikhail Pershin
              mhanafi Mahmoud Hanafi
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: